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0.  OVERVIEW 


The  Inverse  Dynamics  problem  consists  (loosely)  of  computing  the  motor  torques  necessary 
to  drive  a  mechanical  manipulator  through  a  specified  motion,  given  that  you  know  where  it 
should  go.  compute  what  you  have  to  do  to  get  it  to  go  there.  Recently,  efficient  recursive 
formulations  have  been  developed  using  both  Newton-Euler  and  Lagrangian  dynamics 
[16](29].  These  have  reduced  the  number  of  additions  and  multiplications  (for  n  joints) 
from  an  r(n')  dependency  (see  Table  0.1)  to  one  linear  in  the  number  of  joints,  0(n), 
requiring 


(150n  -18)  Mutts  ■  ( 131  n  is)  Addr.s  (Newton  Euler) 

multiplications  (Mult*)  and  additions  (Addns)  when  performed  serially. 

This  paper  investigates  the  high  degree  of  parallelism  inherent  in  the  calculations,  and 
presents  two  formulations  especially  suited  to  nighty  parallel  implementations  using  special- 
purpose  hardware  or  VLSI  devices.  Table  0.1  shows  the  improvement  over  serial  implemen 
tations  (Note  that  this  reflects  the  algorithmically  indicated  cost,  as  in  Hollerbach[16]:  see 
the  discussion  at  the  beginning  of  Section  III.) 

The  first  formulation  is  again  linear  in  the  number  of  joints,  but  reduces  the  real  time 
coefficients  by  almost  two  orders  of  magnitude  to 

(2n  -t-  3)  Mutts  -  (6 n  4-  7)  Addns  (Newton- Euler) 

The  second  formulation  shows  that  by  exploiting  a  novel  parallel  algorithm  developed 
below,  the  time  required  to  perform  the  calculations  increases  only  as  '(logfn)).  The  time 
dependencies  are 

(2,rlogj(n  +  1)]  5)  Mult)  -r  (6[log2(n  +  1)]  -f  10)  Addn >  (Newton-Euler) 

Either  formulation  is  susceptible  to  a  systolic  pipelined  architecture.  We  show  below 
that  the  basic  time  cycle  of  the  algorithm  is  1  Mult  +  3  Addns.  Thus,  after  the  first  complete 
set  of  joint  torques  had  emerged  from  the  pipeline,  successive  sets  would  appear  at  intervals 


of  four  floating  point  operations  (4  Flops).  This  yields  the  ability  to  rapidly  and  efficiently 
evaluate  a  large  number  of  alternatives. 

Section  I  contains  a  brief  introduction  to  the  problem  and  review  of  previous  work  in 
the  area. 

Section  II  explains  in  detail  the  notation  used  in  this  paper .  It  is  essentially  an  adaptation 
of  the  notation  used  by  Hollerbach[16]  and  Luh  et  al.[29],  which  in  turn  derives  from  the 
Denavit  Hartenberg[10]  convention  for  lower-pair  linkages  through  Uicker[43]  and  Kahn[19], 
The  reader  already  broadly  familiar  with  this  notation  should  at  least  review  Table  11.1  where 
the  notation  is  summarized. 

Section  III  explicates  the  first  approach  considered,  yielding  an  ^(n)  formulation  with 
greatly  reduced  coefficients  Luh  et  a!  [29]  give  the  recursive  form  of  the  Newton-Euler 
formulation  as  shown  in  Table  1.1.  Many  of  the  computations  associated  with  any  given 
joint  (node)  may  proceed  concurrently.  For  example,  the  computations  of  the  variables 
j.  (denoted  by  (*)  in  Table  1.1)  and  1-  (**)  do  not  interact  (given  that  w  .  ,  and  J 
have  already  been  computed)  and  hence  may  be  performed  at  the  same  time  by  different 
sub  processors  Additionally,  different  sub  expressions  of  the  same  variable  may  often 
be  computed  concurrently  by  different  sub  processors  Finally,  by  locally  pipelining  the 
recursive  variables  additional  speed  may  be  gained;  e  g  .  the  computation  of  ri  may  be 
started  before  the  computation  of  u,  has  finished.  (See  Tables  III. i  and  III. 2).  Essentially, 
this  formulation  arises  from  trying  to  compute  as  much  as  possible  as  early  as  possible,  and 
still  remains  within  a  basic  linear  structure. 

Section  IV  shows  the  derivation  of  an  C (log(n))  time  dynamics  formulation.  This  arises 
from  restructuring  the  fundamental  framework  within  which  computation  proceeds,  together 
with  a  corresponding  revision  to  the  recursive  equations.  The  linear  recursive  algorithm  is, 
conceptually,  a  formalism  for  beginning  at  the  manipulator  base  and  propagating  desired 
motion  outward  link  by  link  to  the  tip.  then  propagating  tip  environmental  forces  and  torques 
back  inward  link  by  link  to  the  base  (determining  the  needed  joint  motor  torques  along  the 
way).  (See  Figure  111.1).  In  contrast,  the  logarithmic  recursive  algorithm  is  a  mechanism  for 
recursively  propagating  desired  motion  (or,  forces  and  torques)  between  any  two  adjacent 
groups  of  links.  Conceptually  speaking,  the  propagation  of  desired  motion  from  base  to  tip 
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is  accomplished  by  grouping  together  adjacent  links  on  the  first  step  to  form  (n/2)  groups 
of  two  links  each,  then  on  each  succeeding  step  grouping  adjacent  pairs  of  groups  together 
until  after  [!og.,(n)l  steps  there  is  one  group  encompassing  all  links  (actually,  the  intermediate 
groups  are  also  formed).  It  is  analogous  to  summing  n  numbers  in  [log.,(n)j  steps  by  adding 
together  first  adjacent  pairs,  then  adjacent  pairs  of  pairs,  and  so  forth.  (See  Figures  IV.1 
and  IV. 2). 

A  synthesis  of  the  two  approaches  is  presented  in  Section  V.  The  techniques  of  Section 
III  for  exploiting  parallelism  within  a  node  are  applied  to  the  C(log(n))  time  structure  of 
Section  IV,  yielding  an  £(k>g(n))  formulation  with  reduced  coefficients. 

Ultimately  the  algorithm  must  be  expressed  in  hardware,  and  Section  VI  addresses 
a  few  words  to  potential  implementations.  The  principle  thrust  of  this  paper  lies  in  the 
analytic  formulations  above,  and  we  will  consider  hardware  only  to  the  exte'  .  jhowing 
that  physical  implementations  are  reasonable  and  feasible.  We  consider  tht  'a|  number 
of  processors,  buffering  of  intermediate  results,  and  internal  communication,  only  to  the 
extent  of  showing  that  the  requirements  are  reasonable. 

Construction  of  suitable  hardware  using  today's  technology  argues  for  a  special-purpose 
VLSI  chip,  and  Section  VII  presents  one  architecture  suitable  for  this  purpose.  A  primitive 
module  consisting  of  two  multipliers,  one  adder,  and  some  registers  is  sufficient  to  support 
all  of  the  computation  required.  Several  such  primitive  modules  may  then  be  assembled, 
together  with  a  suitable  control  structure,  to  produce  a  matrix  vector  arithmetic  module. 
These  may  then  be  connected  into  a  network  corresponding  to  the  communication  structure 
of  the  algorithm,  and  each  module  programmed  by  the  host  computer  to  execute  the 
operation  and  communication  sequencing  necessary  to  implement  the  algorithm  (or  for  that 
matter,  any  other  high-speed  straight-line  matrix-vector  computation). 

Finally.  Section  VIII  will  very  briefly  allude  to,  without  discussing  in  depth,  a  few  possible 
extensions  to  this  work.  The  ability  to  efficiently  pipeline  implies  that  a  number  of  considered 
variations  on  the  same  basic  manipulator  trajectory  could  be  explored  in  parallel,  and  the 
one  having  the  most  satisfactory  dynamical  characteristics  for  actual  execution  chosen  from 
among  that  set.  It  may  even  be  possible  to  build  an  on-line  “optimizing  trajectory  compiler" 
in  which  the  desired  motion  (trajectory)  for  the  next  several  time  periods  is  pre-planned, 
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the  motor  torques  automatically  generated,  and  the  time  sequence  inspected  slightly  before 
the  manipulator  has  actually  arrived  at  the  trajectory  points,  thereby  incorporating  some 
dynamical  considerations  into  trajectory  planning  Poggio  has  a  result  indicating  that  neural 
structures  could  perform  an  arithmetic  multiplication  in  about  a  millisecond,  which  implies 
that  in  principle  it  would  be  possible  to  compute  the  Inverse  Dynamics  in  approximately  real 
time  using  a  suitable  neural  structure.  We  close  with  a  few  remarks  concerning  generalization 
of  the  D(log(n))  embedding  to  other  recursive  algorithms. 
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Table  0.1  —  Comparison  of  Dynamics  Formulations 
(adapted  from  Hollerbach[16]) 


0.1  a  —  Comparison  of  Time  Dependencies 

Method 

Multiplications 

Additions 

Uicker/Kahn 
(original  Lagrangian) 

32  £n<  +  86  j^n3  +  171  $n2 
+  53  Jn  -  128 

25n4  +  66$n3  +  129$n2 
+42  Jn  —  96 

Waters 

(partially  recursive) 

106in2  +  620^n  —  512 

82nJ  +  514n  -  384 

Hollerbach 
(4x4  Lagrangian) 

830n  -  592 

675n  -  464 

Hollerbach 
(3x3  Lagrangian) 

412n  -  277 

320n  -  201 

Luh,  Walker,  Paul 
(Newton  Euler) 

150n  -  48 

131n-  48 

Horn,  Raibert 
(table  look-up) 

2trt  -t-  n- 

n;t  +  n-  -f  2 n 

Luh,  Lin 

(scheduled  parallel  N.E.) 

57 n  -  18 
(estimated  —  see  text) 

50n  -  18 
(estimated  —  see  text) 

Lathrop 

(linear  parallel  N.E.) 

In  +  3 

6n  4-  7 

Lathrop 

(logarithmic  parallel  N.E.) 

2[log2(n+  1)1  +  5 

6[log2(n  +  1)]  +  10 

Lathrop 

(systolic  pipeline) 

1 

(successive  —  see  text) 

3 

(successive  —  see  text) 

*  This  table  reflects  the  algorithmically  indicated  cost  for  the  fully  general  6-link  rotary  manipulator, 
as  in  Hollerbachjl6/.  By  considering  special  cases,  introducing  configuration  or  workspace  assump¬ 
tions,  or  tailoring  the  computation,  additional  reductions  are  possible.  See  the  discussion  at  the 
beginning  of  Section  Iff. 


-•-iHwm 


Table  0.1  —  Comparison  of  Dynamics  Formulations* 
(adapted  from  Hollerbach[16J) 


O.lb  —  Comparison  for  n  =  6 

Method 

Additions 

Uicker/Kahn 
(original  Lagrangian) 

66,271 

51,548 

Waters 

(partially  recursive) 

7,051 

5,652 

Hollerbach 
(4x4  Lagrangian) 

4,388 

3,586 

Hollerbach 
(3x3  Lagrangian) 

2,195 

1,719 

Luh,  Walker,  Paul 
(Newton  Euler) 

852 

738 

Horn.  Raibert 
(table  look-up) 

468 

264 

Luh,  Lin 

(scheduled  parallel  N.E  ) 

323 

(estimated) 

280 

( estimated ) 

Lathrop 

- -ll-l  M  C  \ 

tHIIWUI  ^/Ul  UMVI 

15 

43 

Lathrop 

(logarithmic  parallel  N.E.) 

11 

28 

Lathrop 

(systolic  pipeline) 

1 

(successive) 

3 

(successive) 

*  This  table  reflects  the  algorithmically  indicated  cost  Jor  the  fully  general  6-link  rotary  manipulator, 
as  in  Hollerbach(16j.  By  considering  special  cases,  introducing  configuration  or  workspace  assump¬ 
tions,  or  tailoring  the  computation,  additional  reductions  are  possible.  See  the  discussion  at  the 
beginning  of  Section  III. 
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1.  INTRODUCTION 


In  active  articulated  mechanisms,  including  both  artificial  and  biological  systems,  the 
parameters  which  one  typically  can  directly  control  are  the  forces  (in  translational  joints, 
sometimes  called  prismatic)  and  torques  (rotational  joints,  sometimes  called  revolute)  applied 
by  the  actuators  to  the  joints.  Unfortunately,  the  parameters  in  which  one  is  frequently 
interested  are  the  linear  and  rotational  accelerations  (hence  also,  velocities  and  positions). 
This  gives  rise  to  two  dual  problems;  both  highly  complex  and  non  linear,  and  both  desirable 
to  calculate  in  real  time. 

The  Direct  (or  Integral)  Dynamics  problem  is  to  compute  the  mapping  V  from  a  set 
of  applied  joint  forces  and  torques  (r,,  arising  from  stimulation  of  the  actuators )  into  the 
resulting  linear  and  rotational  joint  motions  (accelerations  ij,): 

-»  {?,}■ 

Computing  such  a  mapping  is  equivalent  to  simulating  the  motion  of  the  mechanism  under 
the  applied  actuator  effects.  In  this  case  one  knows  what  one  does  to  the  thing,  and  wishes 
to  find  out  where  it  will  go  in  response. 

The  Inverse  Dynamics  problem  is  to  compute  the  inverse  of  the  above  mapping;  given 
the  accelerations  desired,  find  the  forces/torques  necessary; 

£-’■{?,} 

Given  that  one  knows  where  one  wants  the  thing  to  go,  what  does  one  have  to  do  to  make 
it  go  there?  This  is  the  question  that  the  Inverse  Dynamics  seeks  to  answer.  “Where  one 
wants  it  to  go"  is  the  desired  trajectory,  the  manipulator  configuration  as  a  function  of 
time.  The  configuration  may  be  completely  specified  by  the  joint  positions,  so  the  trajectory 
may  be  given  by  stating  each  joint  position  as  a  function  of  time  (i.e.,  ?(f),  where  ?  is  an 
n  dimensional  vector  giving  the  actual  positions  ?,  of  the  n  joints).  These  functions  are 
assumed  to  be  twice  time-differentiable  to  provide  joint  velocities  and  accelerations  ( <j(t ), 

9(f))- 

The  question  is  usually  posed  by  giving  the  joint  positions  and  velocities  (g(<o).  <j(*o)) 
which  describe  the  state  of  the  manipulator  at  a  given  point  in  time  (say,  t0),  together  with 


the  joint  accelerations  which  are  desired  at  that  point  (?(t0))-  The  answer  expected  is  the 
n  motor  torques  (the  n-dimensional  vector  t(£0)  giving  the  motor  torques  t,(<0)  at  each  joint 
0  which,  if  applied  to  the  manipulator  at  that  point  in  its  state-space,  would  induce  the 
desired  accelerations.  Typically  one  measures  q  and  q,  while  q  is  supplied  by  a  higher-level 
planning  and  control  module.  Position  and  velocity  are  the  time  integrals  of  acceleration,  so 
acceleration  control  suffices,  at  least  in  a  “mathematically  exact”  sense.  There  are  a  host  of 
practical  problems  which  insure  that  the  model  and  the  reality  it  models  never  quite  match, 
and  which  we  shall  relegate  to  feedback.  For  control  purposes  the  formalism  provides  a 
good  first  approximation  to  the  "inverse  plant”,  which  is  close  enough  to  render  feedback 
correction  feasible. 

Computing  the  motor  torques  is  quite  complicated,  however,  due  to  the  high  degree 
of  non  linearity  inherent  in  rigid-body  rotational  mechanics.  The  torques  supplied  must 
compensate  for  the  inertia  of  the  manipulator,  gravitational  force,  the  Coriolis  and  centrifugal 
forces,  and  viscous  friction  at  the  joints.  Viscous  frictional  forces  often  depend  only  on  q, 
and  q  at  joint  i;  hence  they  are  susceptible  to  relatively  simple  correction  and  will  therefore 
be  ignored.  All  of  the  other  terms  vary  in  a  non -linear  fashion  depending  on  the  manipulator 
configuration  at  a  given  point  in  time:  additionally  the  Coriolis  and  centrifugal  forces  also 
depend  on  all  pairwise  products  (q,q,.  1  <  i,j  <  n)  of  joint  velocities. 

I  his  complicated  computation  has  until  recently  posed  a  bottleneck  in  on-line  con¬ 
trol  of  manipulators,  and  much  effort  has  been  expended  in  devising  more  time-efficient 
methods.  Typical  resonant  frequencies  of  many  mechanical  manipulators  is  around  10 Hz, 
so  the  computation  must  be  repeated  at  about  60Hz  or  faster[29].  Uicker[43]  and  Kahn[l9J 
derived  an  early  formulation  for  an  n-link  manipulator  based  on  the  Lagrange  equations. 
This  had  an  0(n4)  time  complexity  and  required  7.9  seconds  on  a  PDP  11/45  to  com¬ 
pute  the  torques  for  just  one  point  in  the  trajectory[29].  This  was  too  slow  for  on-line 
control.  Efforts  to  improve  this  time  have  either  explored  other  computational  algorithms 
[44].[45],[35],[30],[29],[16j,  made  simplifying  assumptions  [4], [35],  or  substituted  table  look¬ 
up  for  computation  [38],[1],{2], 

Since  only  the  Coriolis  and  centrifugal  terms  involve  pairwise  products  of  all  joint 
velocities,  a  common  simplification  is  to  just  ignore  them.  Unfortunately  this  works  well  only 
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at  low  velocities  where  the  product  terms  are  small.  During  fast  motion  the  Coriolis  and 
centrifugal  terms  may  dominate  the  computation  [29]  to  such  an  extent  that  attempts  to 
control  the  errors  by  feedback  require  excessive  corrective  torques[38]. 

Table  look  up  is  in  principle  the  fastest  method  of  obtaining  the  necessary  torques. 
If  one  could  index  a  table  of  size  0{n3)  using  the  3n  values  of  (9,,  q„  9,,  1  <  «  <  n), 
all  computation  could  be  replaced  by  memory  references.  This  extreme  was  explored  by 
Albus[1].[2].  Raibert[37]  reduced  the  table  size  by  proposing  a  table  indexed  by  position 
and  velocity  (with  acceleration  handled  separately),  and  this  was  further  refined  by  Raibert 
and  Horn[38]  to  a  table  indexed  only  by  position.  Nonetheless  for  fine  enough  division  of 
the  table  dimensions  the  size  required  remains  enormous,  filling  the  table  and  interpolating 
between  stored  values  present  problems,  and  the  table  is  valid  only  for  one  particular  load 
(e  g.,  mass  of  object  grasped)[t6]. 

Two  other  formulations,  not  reflected  in  Table  0.1  because  particularized  to  special 
cases  and  hence  not  directly  comparable,  deserve  mention.  Kane  and  Levinson[20]  discuss 
a  formalism  (Kane  s  Dynamics,  originally  developed  for  complex  spacecraft  control)  based  on 
generalized  coordinates  and  velocities  similar  to  the  Lagrangian  approach.  The  dynamical 
parameters  can  be  represented  explicitly  so  as  to  exploit  simplifications  arising  from 
manipulator  configuration  or  workspace  constraints,  though  the  computational  complexity 
therefore  reflects  both  configuration  and  workspace  assumptions.  They  analyze  the  Stanford 
arm  under  the  assumption  that  the  workspace  never  requires  the  second  joint  to  approach 
e,  ~  0°  or  9,  =  180°  (due  to  numerical  instabilities  there).  Hollerbach  and  Sahar[18]  present 
a  method  of  merging  the  inverse  kinematics  with  the  inverse  dynamics,  particularized  to  the 
case  of  a  robot  with  a  spherical  wrist.  Many  of  the  kinematic  parameters  generated  in  the 
inverse  kinematics  are  needed  in  the  inverse  dynamics,  and  additional  savings  arise  from 
the  simplification  of  assuming  a  spherical  wrist.  Other  special  cases  are  discussed  in  the 
beginning  of  Section  III. 

The  most  successful  formulations  of  the  general  inverse  dynamics  involve  recursive 
algorithms.  Waters[44]  first  presented  an  0(n2)  partial  recursive  form  of  the  Lagrange 
equations,  which  was  made  fully  linear  recursive  by  Hotlerbach[16].  Hollerbach  also  contains 
an  overview  of  contrasting  approaches  to  the  inverse  dynamics  problem,  to  which  the 
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interested  reader  is  referred  for  further  details.  Orin  et  al.[34]  first  presented  a  linear 
recursive  form  of  the  NewtonEuler  equations,  which  was  refined  by  Luh  et  al.[29]. 

We  have  considered  both  the  Newton  Euler  and  the  Lagrangian  formulations,  in  the 
form  presented  by  Luh  et  al.[29]  and  Hollerbach[16].  Silver[40]  has  shown  them  to  be 
fundamentally  equivalent,  differing  mainly  in  that  the  representation  of  angular  velocity  in 
the  Newton  Euler  equations  is  more  efficient.  Reflecting  this,  the  parallel  formulations  for 
both  formalisms  are  found  to  require  approximately  equal  parallel  time  to  compute,  but  the 
Lagrangian  formulation  would  require  substantially  more  hardware  to  implement.  This  paper 
will  thus  present  only  the  Newton  Euler  results.  Complete  details  for  the  Lagrangian  case 
may  be  found  in  Lathrop[25). 

The  underlying  intuition  in  both  cases  is  the  following.  The  motion  (by  which  we  will 
generally  mean:  position,  velocity,  and  acceleration)  of  the  manipulator  base  is  assumed 
known,  as  is  the  motion  of  each  individual  joint  By  beginning  at  the  base  and  accounting 
for  the  (known,  desired)  joint  motion  at  each  joint,  the  motion  of  each  successive  link  may 
be  cascaded  recursively  from  the  base  of  the  manipulator  to  the  tip.  Since  the  forces 
and  torques  applied  by  the  environment  to  the  last  manipulator  link  (the  tip,  workhead,  or 
end-effector)  are  known  or  measured,  and  the  motion  of  the  last  link  is  known,  the  force 
or  torque  at  the  last  joint  necessary  to  drive  the  last  link  through  its  desired  motion  may 
be  calculated  directly  from  free-body  rotational  mechanics  This  in  turn  allows  calculation 
of  the  forces  and  torques  transmitted  across  the  last  joint  to  the  next  inboard  neighbor 
link,  which  in  turn  allows  calculation  of  the  next  but  last  force  or  torque  from  rotational 
mechanics.  Continuing  in  toward  the  base  in  this  fashion,  and  accounting  for  the  forces 
and  torques  passed  across  each  joint,  the  force  or  torque  necessary  at  each  joint  to  drive 
each  link  through  its  desired  motion  may  be  calculated  in  linear  time.  The  required  motor 
force  or  torque  at  each  joint  is  then  the  component  of  force  or  torque  along  or  about  the 
joint  axis.  Thus  in  linear  time,  the  motor  torques  needed  to  support  a  desired  motion  may 
be  computed, 

A  number  of  practical  attempts  have  been  made  to  relieve  the  host  computer  of  some 
of  the  computational  burden  associated  with  manipulator  control.  The  principle  idea  is  that 
machine  management  should  be  performed  outside  the  main  computer  (CPU).  A  common 


strategy  is  to  use  a  single  microprocessor  to  control  the  robot,  and  the  host  to  control  the 
controller,  e  g.  [11], [13],  and  [23].  This  paper  will  not  consider  these  further  because  the 
parallelism  achieved  is  minimal,  and  they  do  not  address  arm  dynamics. 

Most  attempts  to  apply  parallel  processing  to  manipulator  control  have  involved  using 
microprocessors  to  servo  individual  joints,  and  have  not  attempted  to  include  dynamical 
considerations.  Typically,  this  involves  the  microprocessor  as  the  active  element  in  a  joint 
control  feedback  loop  with  sensors  to  monitor  the  error  (usually  of  joint  position).  Also, 
the  individual  joint  servos  for  each  joint  axis  are  usually  under  the  control  of  a  master 
microprocessor,  which  coordinates  their  actions  with  commands  from  the  host.  In  fact, 
usually  the  servo  microprocessors  will  not  communicate  with  each  other  directly  at  all.  Shin 
and  Malin[39]  discuss  one  control  strategy  for  this  general  approach. 

Acting  under  this  general  paradigm,  Cook  et  al.[9]  discuss  a  configuration  of  seven 
68000s  and  a  programmable  logic  controller,  designed  for  welding  underwater  pipelines. 
Kuo[24]  describes  an  arm  mounted  on  a  mobile  platform,  having  one  microprocessor  per 
motor  and  based  on  an  AIM  65  system.  Rafauli  et  al.[36]  control  a  modified  Ummate  2000, 
using  one  Intel  8748  to  servo  each  axis  based  on  positional  feedback  and  a  master  composed 
of  an  iSBC  86/12A  combined  with  an  iS8C  337.  Gupta[14]  advocates  several  advantages 
of  the  MK68000  for  similar  applications.  A  single-board  controller  for  a  pneumatic  drive 
arm,  using  one  microprocessor  at  each  axis  with  feedback  from  air  pressure  and  speed 
as  well  as  position,  is  presented  by  Goshorn[12],  Carlisle[8]  additionally  dedicates  several 
microprocessors  to  sub-tasks  such  as  sensor  monitoring  or  I/O,  though  his  interest  leans 
somewhat  more  to  computer  numerically  controlled  machines.  Distributed  manipulator 
control  schemes  for  servo-manipulators  used  in  nuclear  reactor  maintenance  are  described 
by  Besant[5]  and  Martin  et  al.[31  ].  The  OSU  Hexapod  (an  18  degree-of-freedom,  motor-driven 
walking  machine)  is  controlled  by  an  experimental  multiprocessor  consisting  of  five  LSI-1  Is 
[21].  These  are  reconfigurable  so  that  tree,  star,  and  loop  structures  can  be  simulated.  This 
multiprocessor  has  also  been  used  for  real-time  optimization  of  leg  tip  forces,  a  task  which 
is  not  strictly  parallel  in  nature. 

Mudge  and  co-workers  in  several  papers  outline  a  scheme  whereby  the  general  purpose 
computer  incorporates  attached  special-purpose  processors  for  real  time  numerically  inten- 
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sive  computations.  The  special  processor  proposed  is  a  single  chip  implementation,  which 
would  interpolate  between  set  points  from  the  host  and  compute  the  correction  torques  for 
each  joint  of  the  robot  arm,  replacing  their  current  PUMA  control  scheme  of  one  LSI- 11/02 
and  six  6503s  (26],  Though  this  particular  application  ignores  dynamic  coupling  between 
joints,  another  paper  [33]  proposes  a  scheme  to  unify  Resolved  Motion,  Gross  Motion,  and 
Fine  Motion,  based  on  the  Newton  Euler  formalism,  suitable  for  implementation  in  real-time 
on  their  processor.  The  processor  (called  by  them  NP,  the  Numerical  Processor)  is  described 
[32]  as  lying  between  Floating  Point  Systems'  AP120B  (a  high  performance  numerically 
oriented  attached  processor)  and  the  Intel  8087  (a  single  chip  numerically  oriented  attached 
processor  in  the  Intel  8087  family)  A  similar  approach  is  described  by  Turner  et  al  [42] 
involving  a  distributed  processing  architecture  using  three  microcomputers  in  a  pipelined 
configuration,  also  proposed  for  a  broad  class  of  advanced  manipulator  control  algorithms. 

In  the  work  most  closely  related  to  this  paper,  Luh  and  Lin  describe  a  procedure 
for  scheduling  the  sub  tasks  of  a  group  of  microprocessors  computing  the  Newton  Euler 
dynamics  [27], [28],  One  microprocessor  is  again  assigned  to  each  controlled  joint  axis, 
but  in  contrast  to  the  servo  based  approaches  above,  the  microprocessors  do  communicate 
and  arm  dynamics  are  explicitly  computed.  Each  microprocessor  computes  the  recursion 
variables  which  correspond  to  its  joint  axis.  Since  these  variables  (as  well  as  intermediate 
partial  resunsj  recursively  aepeno  on  eacn  otner  in  various  amerent  ways,  otten  some 
microprocessor(s)  will  be  idle  while  waiting  for  others  to  complete  pending  sub-tasks  and 
idle  time  must  be  included  in  the  schedule.  It  is  a  non  trivial  scheduling  problem  to  assign 
each  sub  task  of  each  microprocessor  a  specific  execution  sequence  which  minimizes  the 
global  computation  time.  Because  the  form  of  the  equations  (and  hence  the  sub-tasks  to  be 
performed)  for  a  joint  differs  depending  on  whether  that  joint  is  rotational  or  translational,  in 
general  each  new  manipulator  configuration  requires  a  new  rescheduling  of  sub-tasks. 

The  procedure  adopted  is  a  modified  branch-and-bound  search  through  the  space 
of  possible  sub-task  orderings,  terminating  when  the  minimum-time  ordering  has  been 
found.  Any  feasible  ordering  which  accomplishes  all  sub- tasks  is  first  found,  then  refined 
by  generating  and  comparing  alternatives  (other,  partial,  orderings  from  branch  (choice) 
points).  The  estimated  total  time  of  a  partial  task  ordering  is  its  partial  time  so  far  (including 


idle  time),  plus  the  time  for  all  its  remaining  sub  tasks  to  complete  if  idle  time  is  ignored. 
Since  this  is  guaranteed  to  be  an  underestimate  of  true  total  time,  branch -and  bound  search 
applies.  Each  partial  ordering  is  extended  until  either  its  estimated  total  time  exceeds  that 
of  the  minimum-time  feasible  ordering  so  far  found,  or  it  is  extended  to  a  complete  feasible 
ordering  of  less  total  time  and  so  becomes  the  new  minimum.  At  the  conclusion  of  this 
procedure,  the  path  of  minimum  true  total  time  is  the  optimum  schedule.  Though  by  its 
nature  this  procedure  is  not  subject  to  precise  analysis  of  the  computational  complexity  of 
the  resulting  schedules,  the  authors  report  a  concurrency  factor  of  2.64  on  the  Stanford 
arm.  This  estimated  factor  is  used  in  Table  0.1. 
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Table  1.1  —  Linear  Recursive  Newton-Euler  Formulation 
(after  Luh  et  al.  [29]) 


Newton-Euler  Backward  Recursion: 


Af(w,_i  -f  if  joint  I  rotational; 


(*) 


if  joint  t  translational. 


i  +  +  w,  _  ,  x  *,_i q  )  if  joint  t  rotational; 

w,  '  (*•) 


U7w,_, 


if  joint  t  translational. 


P,  ~ 


+  <0,  x  p\  +  u  x  (w,  x  *)  joint  *  rotational; 

+u,  x  4-  u  y  (lj,  x  p")  -4-  AT z  ,<j  -4  2u>  v  A1  ^  1 9,  if  joint  t  translational. 


2  NOTATION 


The  notation  is  based  on  that  used  by  Hollerbach  [16]  and  Luh  et  a(.  [29j  in  their  analyses  of 
the  linear  recursive  inverse  dynamics,  which  in  turn  derives  from  the  Denavit  and  Hartenburg 
[10]  convention  for  lower-pair  chains.  Coordinate  systems  are  fixed  in  the  body  of  each 
link  and  related  rotationally  by  3  X  3  matrix  coordinate  transforms  and  translationally  by 
distinguished  body-fixed  position  vectors.  The  notation  'v}  is  used  to  denote  the  vector  v: 
referred  to  coordinate  system  0,. 

It  is  worth  emphasizing  that,  with  few  exceptions,  ALL  vectors  in  this  paper  are  referred 
to  link  coordinates.  This  obviates  the  complexity  and  computational  overhead  of  transforming 
everything  to  base  coordinates.  Elsewhere  in  the  literature.  %  (denoting  the  vector  Vj 
referred  to  base  coordinates)  is  usually  abbreviated  simply  as  Since  we  will  almost 
never  refer  any  vector  to  other  than  its  own  link  coordinates,  we  adopt  instead  the  notation- 
simplifying  convention  that  v,  abbreviates  vJ ,  a  vector  referred  to  its  own  coordinate  system 

V  >  - 

The  links  of  a  mechanism  are  numbered  consecutively  1  to  n  from  base  to  tip,  with  link 
0  denoting  the  base  reference  frame.  There  is  another  fictitious  link  n  -  1  attached  to  the  tip 
when  convenient,  which  may  represent  the  object  grasped  or  account  for  environmentally 
applied  forces  and  torques  at  the  workpiece.  Attached  to  link  i  is  a  right  handed  orthogonal 
coordinate  system  0,  with  axes  (x,,y,,z,). 

Joints  (equivalently,  hinges)  occur  between  the  links.  Each  joint  is  denoted  by  the 
number  assigned  to  its  distal  (outboard)  link,  so  that  joint »  connects  links  i  —  1  and  t.  Joints 
may  be  either  rotational  or  translational,  but  may  have  only  one  degree  of  freedom.  Multiple 
degrees  of  freedom  at  a  joint  are  modeled  by  introducing  fictitious  links  having  zero  mass 
and  length. 

For  any  two  adjacent  coordinate  systems  0,  and  0,_,  there  is  an  orthonormal  rotation 
matrix  A,  mapping  vectors  whose  coordinates  are  referenced  to  0,  into  the  corresponding 
vectors  referenced  to  0,_i.  Note  that  this  is  a  pure  rotation.  The  coordinate  systems  are 
located  in  the  links  so  as  to  simplify  the  form  of  this  matrix.  The  orthonormal  basis  vectors 
of  0,  are  arranged  as  follows: 
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z  lies  along  the  positive  axis  of  joint »  +  1, 

x,  lies  along  the  common  normal  from  z,_t  to  z,, 

y, =  z,  x  x,  completes  the  right-handed  orthonormal  system. 


Since  the  joints  have  but  one  degree  of  freedom,  x„  z,,  and  z,_ ,  are  all  body  fixed  vectors 
in  0,  (i.e.,  have  a  fixed  direction  relative  to  the  body-fixed  coordinate  system).  The  rotational 
orientation  of  two  adjacent  systems  is  completely  described  by 

q,  is  the  angle  between  z,_,  and  z,  in  a  right-handed  sense  about  x,, 

0,  is  the  angle  between  x,_,  and  x,  in  a  right-handed  sense  about  z,_1. 

The  vectors  z._,  and  z,  being  both  fixed  in  0,,  a,  is  constant.  Since  z,_ ,  lies  along  the 
joint  axis  of  joint  i,  and  x._,  and  x.  are  fixed  in  0,  —  i  and  J,  respectively  and  both  normal  to 
z  0,  measures  the  relative  rotation  about  the  joint  axis  between  the  two  systems.  Thus 
if  joint  t  is  rotational  then  6 ,  is  the  joint  variable,  otherwise  9,  is  constant  (see  Figure  11.1). 

By  the  preceding  remarks,  the  rotation  matrix  A  corresponds  to  a  coordinate  rotation 
about  x,  by  an  angle  a,  (which  aligns  z,  and  z,_,)  followed  by  a  rotation  about  the  rotated 

z,  (  _  ,)  by  an  angle  9 ,  (which  aligns  the  other  two  pairs  of  basis  vectors).  If  the  first 

rotation  is  represented  by  the  matrix  <J>,  and  the  second  by  0,,  then 


’I 

0 

0 

*,  = 

0 

cob  a 

—  sin 

a, 

.0 

siu  u 

cos 

*. 

’cos 

8,  - 

-  sin  8, 

O' 

e,  = 

sin 

8, 

cos  8, 

0 

.  0 

0 

1. 

A,  =  Q,*, 

'cos  8,  —  sin  8,  cos  a,  sin  8,  sin  a,  ' 

=  sin  8,  cos  8,  cos  cr,  —  cos  8,  sin  a, 

.  0  sin  a,  cos  a, 

The  translational  orientation  of  two  adjacent  systems  is  completely  described  by 

a,  is  the  distance  between  the  origins  of  0,- 1  and  0,  measured  along  x(; 
a,  is  the  distance  between  x,_,  and  x,  measured  along  z,_ j. 

Since  x,,  z,,  and  z,_i  are  fixed  in  0,,  a,  is  constant.  ,  lies  along  the  joint  axis  of  joint », 
so  if  joint  t  is  translational  a,  will  be  the  joint  variable,  otherwise  a,  is  constant. 


The  following  link  i  vectors,  referenced  to  0,,  permit  a  convenient  specification  of  the 
translational  relationship  between  adjacent  coordinate  systems 

Pi  a  vector  to  origin  0,  from  0,_, 

r.  a  vector  to  the  center  of  mass  of  link  «  from  0, 

*i  =  p,  -f  r' ,  a  vector  to  the  center  of  mass  of  link  i  from  0,_ j. 

From  the  above  remarks  it  is  clear  that 

p,  =  s,Aj z,_!  4-  a,x, 

but  in  virtue  of  the  form  of  A,  -  0,<t>,  and  its  constituents,  we  have 

0,'  2,-  1  =2,-1 
A  [  z, _ i  =  4>,70,T2,_, 

which  is  body  fixed  in  '  Thus  p"  is  composed  of  a  part  a  x,  which  is  fixed  in  C.  and 
represents  (constant)  translation  normal  to  both  z  _  ,  and  z,,  together  with  a  part  s,  4> '  , 
whose  direction  is  fixed  in  J1  and  whose  magnitude  represents  translation  along  2._,  referred 
to  C.  ■  If  joint  i  is  rotational  then  s,  is  constant  and  so  therefore  is  p‘.  else  p'  incorporates 
the  result  of  translational  joint  motion  at  joint ». 

The  rotation  matrices  A,  may  be  cascaded  by  defining  the  rotation  matrix  'IV,,  which 
maps  'u„,  into  'vm.  Since  the  inverse  of  an  orthonormal  matrix  is  equal  to  its  transpose  it 
follows  that 


w.r1 

=  mT 

(A.+i 

...A; 

if »  <  j 

k 

if »  >  j 

the  superscript  T  denoting  transpose  in  either  matrices  or  vectors. 
From  the  above  it  is  clear  that 

°W0  =  l(o*, ).(%).(%)) 
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and 


'Wj  =  {°Wj)°W} 

-  ewtx‘wg 
=  [(•*,).(■»,).  (■*,)]- 

In  the  sections  on  logarithmic  recursion  it  will  be  necessary  to  introduce  another  notation 
for  consistency  with  the  formalisms  developed  there.  We  will  use  W,wj  to  denote  the  .napping 
from  Cj  to  0,-i,  so  that 

Also  in  the  sections  on  logarithmic  recursion  we  will  slightly  abuse  the  dot  notation 
for  vector  time  differentiation  (i.e  ,  J„..  and  p  , )  Elsewhere  in  the  literature  this  denotes 
differentiation  in  the  inertial  frame,  with  differentiation  in  a  rotating  Irame  indicated  by  '  or 
*.  We  will  take  ALL  terms  u„,,  to  denote  differentiation  of  u,.(  in  C  This  becomes 
equivalent  to  the  standard  notation  at  u0.,.  the  case  of  interest,  and  eliminates  proliferation 
of  '  or  *  superscripts  in  much  the  same  spirit  that  taking  v  eliminated  superscripts  of  0. 
This  is  explained  more  fully  in  Section  IV. 
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Table  11.1  —  Summary  of  Notation  Used 
We  define  the  following: 

the  mass  of  link  j, 

a  vector  to  the  center  of  mass  of  link  j  from  the  origin  0 , , 
a  vector  to  the  origin  Oj  from  the  origin  0,-t  , 
accounts  for  gravitational  acceleration,  g  if  j  =  0,  else  0, 

=  p‘  +  r*.  a  vector  to  the  center  of  mass  of  link  j  from  the  origin  C;_,, 

=  rj7”V 

the  angular  velocity  vector  of  link  ;, 

the  inertial  tensor  (with  respect  to  its  center  of  mass)  of  link  j, 

the  joint  generalized  variable  for  joint  j,  0  if  rotational  and  s  if  translational, 

the  joint  generalized  actuator  force  at  joint  j,  torque  if  rotational  and  force  if  translational, 


me  iuidi  iui«je  lexuuumy  gravity;  on  iimk  j. 


the  total  torque  on  link  j, 


constraint  force  (unknown)  exerted  on  link  j  by  link  j  —  1, 


constraint  torque  (unknown)  exerted  on  link  j  by  link  j  —  1, 
a  pure  rotation  matrix  mapping  vectors  in  0,  into  vectors  in  0,-u 
a  pure  rotation  matrix  mapping  vectors  in  Oj  into  vectors  in  0„ 

=  *-lw„ 

va.fc  differentiated  in  0tt- 1  and  referred  to  (?>. 


S.  PARALLELISM  WITHIN  A  NODE 


This  section  will  investigate  the  effect  of  exploiting  parallelism  within  a  node  (joint),  while 
remaining  within  the  general  linear  recursive  framework.  As  a  conceptual  aid  in  this  section 
we  assume  that  there  is  one  group  of  parallel  processors  for  each  joint  (node)  on  the  forward 
and  the  backward  recursion.  However,  since  only  one  node  is  active  in  the  computation 
of  any  given  variable  at  any  one  step,  and  all  nodes  are  identical,  an  implementation  could 
be  constructed  using  only  one  processor  group  by  connecting  the  output  back  to  the  input 
through  a  buffer.  Details  of  how  this  might  actually  be  done  are  explored  in  Section  VI.  If 
only  one  processor  group  is  used,  computation  of  one  set  of  joint  torques  must  be  completed 
before  the  next  can  begin.  Otherwise,  with  one  processor  group  for  each  joint  (node),  it  is 
possible  to  systolically  pipeline  successive  sets  of  joint  torques  at  intervals  of  1  Mult  +  3 
Addns.  or  4  Flops. 

The  linear  recursive  structure  of  the  Newton  Euler  computations  may  be  shown  as  a 
directed  graph  as  in  Figure  III.  1 .  The  essential  structure  of  the  algorithm  can  be  clearly 
seen  —  acceleration  is  propagated  outward  (incorporating  at  each  stage  the  next  joint 
acceleration)  and  forces  are  thereafter  propagated  inward  (allowing  calculation  of  joint 
torque  at  each  stage).  Nodes  represent  the  total  processing  associated  with  each  joint  in 
the  forward  or  backward  recursion,  and  directed  arcs  represent  data  dependencies.  It  is 
clear  that  reducing  the  computational  time  incurred  at  each  joint  (node)  would  imply  a  linear 
reduction  in  the  total  computational  time  (a  constant  factor  speed-up).  For  reasons  which 
will  be  explained  in  Section  IV,  a  double  subscript  is  used  in  Figure  III.  1 :  thus  (0,j)  on 
the  backward  recursion,  and  (j,n)  on  the  forward,  denotes  the  variables  output  by  nodt  j. 
Non  systolic  and  systolic  arrangements  are  indicated  in  Figures  III. 2  and  111.3. 

Tables  III.  1  and  III. 2  show  the  detailed  internal  structure  of  each  node  of  the  directed 
graph  of  Figure  111.1 .  The  computation  times  given  in  Table  HI.  1  reflect  the  times  represented 
by  the  indicated  operation.  In  general  we  follow  Hollerbach[16]  in  accepting  the  principle 
that  economies  in  computation  should  be  explicit  in  the  formulation  rather  than  implicit 
in  the  programming,  and  Table  0.1  reflects  his  analysis  for  the  fully  general  6-link  rotary 
manipulator,  It  should  be  noted,  however,  that  by  “hand-tailoring"  the  computation  to 
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account  for  special  cases  the  computational  cost  can  be  frequently  be  reduced  below  that 
shown  in  Table  0.1.  For  example,  if  the  particular  manipulator  configuration  is  such  that 
some  of  the  constant  angles  a,  are  multiples  of  5 ,  then  the  factors  coso,  and  sino,  become 
0  and  ±1  and  multiplication  by  the  rotation  matrix  A,  can  be  reduced  from  9  Mulls  and 
6  Addns  to  4  Mulls  and  2  Addns.  Since  y,  p‘  =  0,  vector  cross  products  involving  p* 
can  be  calculated  with  a  special  sub  routine  in  only  4  Mulls  and  2  Addns.  Several  other 
optimizations  are  possible,  e  g.  see  [18], [20], [28], [29].  Many  of  these  apply  to  the  Stanford 
arm,  which  therefore  has  a  substantially  lower  computational  cost  than  the  fully  general 
case.  Newton  Euler  dynamics  particularized  to  the  Stanford  arm  requires  only  308  Mults 
and  254  Addns  [28],  and  Kane's  dynamics  requires  only  646  Mults  and  394  Addns  [20].  A 
rotary  manipulator  particularized  to  a  spherical  wrist  and  non  spinning  base  requires  only 
448  Mulls  and  361  Addns  [18].  Our  analysis  captures  the  fully  general  case  with  time  bounds 
substantially  below  these,  and  in  any  event  the  intended  implementation  in  VLSI  argues 
strongly  for  a  regular  and  systematic  treatment  which  avoids  all  special  cases. 

The  times  shown  in  Tables  111.1  and  III. 2  reflect,  for  each  variable,  the  rotational  case 
only.  The  extension  to  the  translational  case  follows  directly  in  analogous  fashion,  and  the 
time  bounds  stated  remain  almost  exactly  the  same  with  only  minor  variation  in  the  constant 
term. 

Note  tnat,  while  on  either  the  forward  or  backward  recursion,  the  linear  coefficient  in 
the  total  time  cost  is  determined  by  the  time  required  to  propagate  the  recursion  variables 
through  a  single  node.  More  specifically,  this  is  the  interval  between  the  time  that  the 
recursion  variables  become  available  to  one  node  and  the  time  that  they  are  made  available 
to  the  next  node.  However,  nothing  constrains  all  of  the  variables  to  be  made  available 
at  the  same  time.  Since  the  different  recursion  variables  are  used  at  different  times  in  the 
computation,  relative  delays  are  acceptable  provided  that  each  variable  is  available  by  the 
time  that  it  is  required  in  the  computation. 

If  the  different  variables  become  available  to  a  node  at  staggered  times  (and  in  turn  are 
made  available  to  the  next  node  at  equally  staggered  times),  the  linear  time  coefficient  is 
determined  by  the  longest  time  required  to  propagate  any  single  variable  across  the  node. 
In  determining  this  time  only  «„  ii„  p„  /„  and  n,  need  be  considered,  as  the  other  variables 
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are  not  propagated  down  the  recursive  chain. 

Examining  Table  III.  1  for  the  Newton-Euler  formulation,  it  is  clear  that  (given  the  proper 
staggering  constant  offsets)  no  variable  on  either  the  forward  or  backward  recursion  requires 
longer  than  a  matrix-vector  multiplication  and  a  vector  addition  to  propagate  through  a  node. 
This  is  evident  because  each  propagated  variable  becomes  available  to  the  (»  +  1)—  node 
one  matrix  vector  multiplication  and  one  vector  addition  after  it  is  made  available  to  the  »— 
node.  Thus,  the  time  required  for  the  entire  dynamics  calculation  may  be  reduced  to 

In  [MV  +  VA)  +  C 

where  (AH')  and  (V'.-t)  are  respectively  the  time  required  to  perform  a  matrix-vector  mul¬ 
tiplication  and  a  vector  addition,  and  C  is  a  constant  which  accounts  for  initiating  the 
calculation 

This  is  the  case  even  though  each  individual  variable  may  take  longer  than  (MV  +  V A) 
to  compute.  Table  III. 2  illustrates  this  arrangement  graphically.  The  times  when  each 
of  the  variables  become  available  are  shown  in  Table  III. 2,  as  well  as  the  intermediate 
partial  results.  For  simplicity  of  presentation  in  Table  III. 2.  a  timing  model  is  used  in  which 
l  A/  u.'f  i  Addr.  1  Flop.  This  metric  will  also  be  used  to  establish  the  relative  times  at 
which  various  partial  results  are  computed  and  made  available. 

w,  depends  on  nothing  except  w,_,  and  ^,_l9,,and  requires  time  (MV-fVA)  to  compute 
once  these  are  available.  It  is  clear  that  successive  values  of  w,  will  become  available  at 
intervals  of  (MV  +  V A),  since  each  depends  upon  nothing  but  its  preceding  recursive 
variable  value  and  the  input,  w,  depends  on  w,_,  as  well  as  on  w,_,  and  z,_i<jt.  However, 

i j,_j  and  z . _ i g.  are  required  in  the  computation  well  before  w,_i.  Given  the  availability  of 

{w,}  at  intervals  of  (MV  -4-  VA),  any  w.-dependent  intermediate  partial  results  required  by 
w,  can  be  calculated  before  w,_i  becomes  available  (by  delaying  u,  if  necessary,  as  will 
be  seen  below).  These  intermediate  results  already  computed,  when  w,_,  does  become 
available  it  will  be  possible  to  compute  w,  in  time  (MV  -)-  VA).  Meanwhile  computational 
precursors  to  have  been  similarly  calculated,  so  that  when  u,  becomes  available  it  is 
then  possible  to  compute  w,+)  in  time  (MV  -f  VA)\  and  so  forth.  In  this  fashion,  it  can 
be  seen  that  the  availability  of  {w,}  at  intervals  of  (MV  +  VA)  implies  the  availability  of 
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{w,}  at  similar  but  offset  intervals.  By  a  similar  reasoning  process,  both  of  these  imply  the 
availability  of  {p, }  at  intervals  of  (MV  +  VA).  The  structure  and  timing  which  realize  this 
are  shown  in  Table  III. 2.  Similar  considerations  would  hold  for  the  forward  recursion. 

The  availability  offsets  and  the  constant  C  may  both  be  determined  from  Table  111.1.  This 
is  done  in  Appendix  A,  where  the  appropriate  offsets  are  shown  to  be 

Avati(w,_i)  >  Avai/(w,_i) -|-  VC  + VA  (*) 

Ava»l(p,_j)  >  At>a«7(w,_i)  -f  2 VC  +  3VA 
Avail(p,_x)  >  At»ai/(w,_,)  +  VC  +  2 VA. 

Avail(n,  +  I )  >  Avatl(f,  +  x)  +  VC  +  VA.  (***) 

The  three  equations  above,  (*),  {**),  and  (***),  define  the  constant  offsets  or  relative 
delays  by  which  the  propagation  of  the  Newton-Euler  recursion  variables  should  be  staggered 
in  order  to  achieve  the  stated  time  bound.  Note  that  if  computation  is  allowed  to  proceed 
on  a  scheme  whereby  a  node  operates  on  its  inputs  as  soon  as  the  data  becomes  available, 
then  these  offsets  will  be  naturally  set  up  and  maintained  by  the  inherent  computational 
delays  shown  in  Table  111.1. 

Next  we  determine  the  constant  J  by  showing  when  r, ,  the  last  generalized  joint  force 
of  the  forward  recursion,  becomes  available  as  output.  This  may  also  be  done  from  Table 
111.1,  and  details  are  also  shown  in  Appendix  A.  There  it  is  shown  that,  assuming  all  input 
values  become  available  simultaneously  at  time  t  —  0,  the  time  required  to  calculate  the 
Newton-Euler  dynamics  exploiting  maximal  linear  parallelism  is 

2n  •  (1  Mult  -|-  3  Addm)  +  (5  Multi  -f  9  Addm). 

Due  to  differences  (not  requiring  additional  computation  by  the  host)  in  assumptions  about 
the  form  of  the  input  and  output  (not  in  the  computation),  this  is  slightly  lower  (by  2  Multi  + 
2  Addm)  than  given  in  [25].  Specifically,  we  assume  that  the  input  buffer  will  deposit  q,  and 
q,  as  the  third  scalar  coordinate  behind  two  pre  stored  scalar  zeroes  so  as  to  make  available 
the  vectors  and  q,  directly,  and  that  the  output  buffer  will  directly  return  the  third 
scalar  coordinate  of  /,  (for  translational  joints)  or  n,  (for  rotational  joints)  as  the  indicated 
joint  force  or  torque.  This  input/output  convention  avoids  the  algorithmically  indicated  cost 
of  SV  and  VD  without  introducing  special  cases  into  the  computational  structure. 
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Table  III.  1  —  Relative  Time  ot  Linear  Data  Dependencies 


Linear  Backward  Newton-Euler  Recursion 

Var. 

Waits  Onf 

Time  at  Step  End 

Step 

Costf 

W| 

a  —  Ava*l(w,_i) 

Input 

T\  =  *i—i  j, 

•  #  # 

a  +  VA 

T2  =  oj. — i  +  T\ 

VA 

a  +  MV  +  VA 

£ 

II 

3 

MV 

;  w, 

a  =  Avatl(w,  —  \) 

Input 

T3  =  Z.-1?, 

•  •  • 

I 

b  —  Avaii(w,_i) 

a  +  VC 

T4  ~  ]  X  Tj 

VC 

a  -f  VC  4  VA 

t:,  =.-  r:,  +  r, 

VA 

j 

max(i>,  a  -r  VC  -tV  A) -*-VA 

T(,  =  T;,  -f  w,_1 

VA 

1 

max(6,a  -r  VC  +  VX) 

A- MV  +  VA 

HI 

p, 

c  —  Avatl(w,) 

c  +  VC 

TV  —  W,  X  p , 

VC 

i 

d  --  Ava\l(u>,) 

c  +  2  VC 

7*  '=  LI,  X  T7 

VC 

t  ~  Avail(jp,_ ,) 

d+VC 

Ti,  =  u>,  X  p* 

VC 

max(d  *■  VT,ct  2 VC)  +  VA 

T,u  —  7V  4-  Tg 

e  4  MV 

r,,  -  p,_, 

MV 

max(e  4  MV,d  f-  VC  4  V A, 
c  +  2 VC  4-  VA)  +  VA 

P,  --  T'l 0  4  T'i  1 

VA 

r, 

c  =  Avati(w,) 

c+VC 

r,2  =  w,  x  r; 

VC 

d  —  Avail[w,) 

c+  2VC 

T'1.1  =  <»>,  X  TJ2 

VC 

f  =  Avail(pt) 

d+VC 

Tm  =  iii,  X  r. 

VC 

max(d  +VC,c+  2  VC)  +  VA 

T’is  =  t\3  4-  rM 

VA 

mnx(f,d  +  VC  +  VA, 
c  +  2  VC  +  VA)  +  VA 

■ 

1 

VA 

t  See  end  of  table  (continued  next  page). 

1  "Auatt(X,_i)  =  t"  means  that  variable  X,—\  is  made  available  to  the  t—  node  at  time  t  (on  the 
Backward  recursion;  substitute  on  the  forward  recursion). 


Table  III.  1  —  Relative  Time  of  Linear  Data  Dependencies  (continued) 


Table  III. 2a  —  Timing  of  Linear  Newton-Euler  Backward  Recursion  (continued) 

Timing  of  p,;  n  =  4,  rotational  joints 


Pq  Pbate 


(For  simplicity  here,  1  Mull  =  1  Addn  —  1  Flop) 


6,  —  ui,  X  (w,  X  p*) 


7.  =  “i  X  P*  + 


p(  =  X  (w,  X  p, ) 

-t-w,  X  p*  -i-  A,'  p, 
=  7-  +  Ajp,_, 


wi  X  p, 


X  Pa 


wi  X  p, 


u>2  X  Pj 


w3XPj 


W3  X  p3 


W4  X  p4 


W4  x  P4 


Table  III. 2a  —  Timing  of  Linear  Newton-Euler  Backward  Recursion  (continued) 

Timing  of  rlt  F„  N,\  n  =  4,  rotational  joints 

(For  simplicity  here,  1  Mult  =  1  Addn  =  1  Flop) 


Table  III. 2b  —  Timing  of  Linear  Newton-Euler  Forward  Recursion 
Timing  of  /,,  n,;  n  =  4,  rotational  joints 

(A  =  fttp\  =  ntipt  Tt  =  [0 , 0,  l]  ■  n, 

(For  simplicity  here,  1  Mult  =  1  Addn  =  1  Flop) 
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Figure  III.  1  —  Linear  Recursive  Graph  Structure 


Physical  Structure: 

^11^1  A 


(Time  Intervals  2 n(MV  +  VA)  +  C  per  complete  set  of  joint  torques) 
(Only  one  physical  processor  per  variable  —  see  also  Figure  VI. 1.) 

Figure  III. 2  —  Non-Systolic  Pipelined  Process,  n  =  4 


(Time  Intervals  1  (MV  +  Va)  per  complete  set  of  joint  torques) 

(" Bkwd =  backward  recursion  processor  for  joint »,  “Fwd,"  *  forward) 
Figure  Ml. 3  —  Systolic  Pipelined  Process,  n  =  4 


4.  PARALLELISM  EXPLOITING  LOGARITHMIC  RECURSION 


The  preceding  section  showed  that  parallelism  potentially  leads  to  considerable  time  savings 
even  within  a  linear  recursive  framework.  This  section  will  show  that  the  linear  time 
dependency  can  be  improved.  By  a  suitable  restructuring  of  the  basic  computational 
framework  within  which  the  nodes  are  embedded,  together  with  a  corresponding  revision  to 
the  recursive  forms  of  the  equations,  an  5(k>g(n))  total  time  may  be  achieved.  For  convenient 
reference,  the  formulae  are  collected  in  Table  IV.1. 

The  basic  intuition  is  illustrated  in  Figures  IV. 1. a  and  b.  applied  to  n  consecutive 
multiplications.  If  these  are  performed  serially,  as  in  Figure  IV.1. a.  then  time  ' (n)  is  required. 
By  processing  in  parallel  (Figure  IV.I.b).  time  '(log(n))  may  be  achieved.  It  is  easy  to  see 
that  general  recursive  calculations  of  the  form 

x,  =  a,z,_,  4-  b, 

may  be  performed  in  time  2(log(n)). 

To  exploit  this  potential  in  the  inverse  dynamics  case,  however,  it  is  necessary  to 
generalize  the  linear  recursive  equations.  The  linear  form  may  be  regarded  as  an  operator 
©  which  maps  a  variable  (A".  _ i )  representing  the  base  through  (i  —  l)-'--  inputs,  together 

with  the  i—  input  (f  )  intn  (V  )  roorpsontmo  tho  haco  through  iti  inputs- 

-  X.. 

For  logarithmic  recursion,  an  operator  ®  is  required  which  maps  a  variable  (Xa>)  repre¬ 
senting  the  a—  through  k'- i  inputs,  together  with  (X(*+1)it,)  from  the  (k  4-  1)^  through  b^ 
inputs,  into  (Xa,6)  representing  inputs  a  through  6: 

Xa,k  ®  X|jt  +  i),|,  *-»  Xa.b- 

The  linear  recursive  equation  is  a  special  case  of  the  more  general  logarithmic  form  in  which 
o  =  0,  k  =  i  —  1,  6  =  k -f- 1  =  «,  A,,,  =  /,,  and  computation  proceeds  serially,  so  that 

X ,  =  X0,, 

—  Ao,(i  —  i)  ®  Ai.t 

=  Xl_1©/„ 
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This  is  the  meaning  of  the  double  subscripts  in  Figure  111.1 . 

What  physical  meaning  can  be  assigned  to  this?  We  will  return  to  this  question  during 
the  analysis  below,  where  the  discussion  can  be  illustrated  more  clearly  by  reference  to 
"real"  physical  quantities.  For  the  present,  however,  the  recursive  forms  (both  linear  and 
logarithmic)  may  be  thought  of  as  mechanisms  for  relating  physical  parameters  at  one 
coordinate  system  (or  link  or  joint)  to  physical  parameters  at  another,  by  abstracting  away 
the  intervening  links  of  the  physical  manipulator.  Very  loosely  speaking,  the  goal  in  both 
cases  is  to  relate  the  acceleration  of  each  link  to  the  acceleration  of  the  base  by  abstracting 
away  the  intervening  joint  accelerations,  then  to  relate  the  distal  forces  and  torques  acting 
on  each  link  to  the  distal  forces  and  torques  acting  on  the  tip  by  abstracting  away  the 
intervening  joint  forces  and  torques  This  done,  the  joint  forces  and  torques  necessary  to 
sustain  the  desired  acceleration  may  be  found  from  a  purely  local  application  of  Newtonian 
mechanics.  We  will  use  the  term  "relational  parameter"  to  refer  to  a  quantity  which  relates 
physical  parameters  at  one  coordinate  system  to  physical  parameters  at  another. 

The  linear  and  the  logarithmic  recursive  forms  differ  principally  in  how  they  approach 
the  problem  of  relating  physical  parameters  between  coordinate  systems  (or  links  or  joints). 
The  linear  form  relates  the  base  (backward  recursion)  parameters  to  the  first  link  parameters 
to  obtain  relational  parameters  of  the  form  A',  (s  AY,)  These  in  turn  are  related  to  the 
second  link  parameters  to  obtain  relational  parameters  of  the  form  X2  (=h  AY::),  which  relate 

the  second  link  to  the  base.  In  sequence,  the  relational  parameters  A03,  Au.., . X0.„,  are 

formed,  which  relate  respectively  the  third,  the  fourth,  and  the  n—  links  to  the  base.  The 
process  may  be  viewed  as  one  which,  at  each  step,  "glues"  the  next  link  parameter  onto 
the  current  relational  parameters  to  produce  the  next  relational  parameter,  thereby  relating 
the  base  to  the  next  successive  link. 

In  contrast,  the  logarithmic  recursive  form  may  be  viewed  as  a  mechanism  for  "gluing 
together"  any  two  adjacent  relational  parameters.  Parameters  of  the  form  A't>,  reflect  only 
the  input  values  at  link  (joint)  i;  those  of  the  form  X,.,  reflect  (abstract  away)  links  »  through 
j.  At  the  first  step,  adjacent  pairs  of  (backward  recursion)  relational  parameters  of  the  form 
AY, 2l,  AY_(_i,2,-m  are  related  in  parallel  to  form  the  relational  parameters  XY.-it+»-  Thus  on 
the  first  step  we  relate  alternating  pairs  of  adjacent  links  to  form  the  relational  parameters 
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JV'u. i ,  X4,i . (if  n  odd).  At  each  j-'-  succeeding  step,  all  adjacent  pairs  of 

the  form  Xa%k,  Xk+UL  are  related  in  parallel  to  form  the  relational  parameters  Xllib,  where 
a  =  2'm,  *  =  a  +  2J-2,  k<b<a  +  2J  —  l,  and  0  <  m  <  n/(2-').  On  the  second  step, 

X0,2,  X0,3l  X4S<  X4J . Xn  — 3,n — i ,  X„-3,n  are  additionally  formed;  on  the  third  step  we 

also  pick  up  X0.t,  X0,5,  X0i6,  X0j,  X8ii2l  . . .,  Xt,-?.,,-!,  X„_7,n;  and  so  forth.  This  process 
is  illustrated  in  schematic  in  Figure  IV.2  (only  the  backward  recursion  is  shown).  There, 
n  =  7,  so  the  backward  recursion  would  take  three  steps  as  shown.  It  is  easy  to  see  that  in 
^og.,(n+  1)]  steps,  all  (backward  recursion)  relational  parameters  of  the  form  X0.,,  0  <  i  <  n, 
may  be  formed.  But  as  noted  above,  these  are  just  the  linear  recursive  variables  Xt.  Thus 
the  backward  recursion  may  be  performed  in  real  time  in  ^(log.,(u  -+  1))  steps  using  parallel 
computation.  The  same  is  true  of  the  forward  recursion,  hence  of  the  Inverse  Dynamics 
computation. 

The  logarithmic  recursion  operator  must  possess  the  following  recursive  properties: 

(a)  Xu,b  must  be  computable  only  from  inputs  a  through  b, 

(b)  X„,b  =  X„,a  ®  X(,  + must  be  computable  from  variables  of  the 
form  Yj. a  or  y(A  +  ,,.,.  or  previously  computed  Z...i  or  non-recursive  values, 
and 


(c)  X0.;  on  the  backward  recursion,  or  X,.„  on  the  forward  recursion, 
must  be  equal  to  the  value  of  the  linear  recursive  variable  X,. 

These  properties  will  allow  the  use  of  a  structure  analogous  to  Figure  IV.2. 

The  remainder  of  this  section  will  be  devoted  to  the  derivation  of  appropriate  logarithmic 
recursive  formulae.  Separate  sets  of  formulae  will  be  developed  for  the  forward  and  the 
backward  recursion  variables.  As  in  Section  III,  it  is  necessary  to  consider  only  those 
variables  which  are  propagated  the  length  of  the  recursive  path.  For  the  Newton-Euler 
formulation  these  are  w,,  w,,  p,,  /,,  and  n,.  These  formulae  are  collected  in  Table  IV.  1. 

Suppose  one  wished  to  find  the  value  of  non  propagated  values,  for  example  to  find  r,„ 
for  node  m  on  the  forward  recursion  path.  One  first  computes  w0.m,w0 ,.n.  and  p0.m  using  the 
logarithmic  forms  given  in  Table  IV.  1.  By  the  remarks  above,  these  are  exactly  wm,  un,  and 
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p,„  of  Table  1.1;  and  the  time  needed  here  to  compute  all  three  is  0(log(m)).  From  these,  r„. 
may  be  computed  in  constant  time  C(c)  using  the  formulae  in  Tables  1.1  and  lll.i. 

In  the  derivations  below,  for  each  linear  recursive  variable,  our  general  strategy  will  be 
as  follows: 

(a)  propose  a  closed-form,  non  recursive  formula  which  is  equivalent 
to  the  linear  recursive  formula  of  Table  1.1, 

(b)  show  that  (a)  is  correct  by  showing  that  it  is  a  fixed  point  of  the 
linear  recursive  formula  considered,  and  holds  for  the  zero  term  (this  is 
equivalent  to  an  inductive  proof) 

(c)  propose  a  non  linear  recursive  formula  which  is  equivalent  to  (a) 
at  a  =  o  and  b  =  »,  and 

(d)  show  that  the  resulting  combining  operator  ®  is  correct  by 
showing  that  it  preserves  the  form  of  the  formula  in  (c). 

The  reader  may  verify  that  for  a  —  0,  k  =■ •  «'  —  1,  6  =  k  +  1  =*,  t  the  formulae  reduce  to  Table 

1.1. 

It  will  be  necessary  to  introduce  several  auxiliary  variables  not  directly  corresponding  to 
any  variable  in  the  linear  recursive  formalism.  There  we  will  be  concerned  to  show  only  that 
the  asserted  combining  operator  is  correct. 

In  the  following  we  assume  a  <  k  <  b  throughout,  the  case  of  o  =  b  being  found  as  a 
special  case  of  (c)  above.  In  order  to  cover  both  the  rotational  and  translational  cases,  it  is 
convenient  to  introduce 

11  ,  joint  I  rotational, 

0  ,  joint  I  translational. 


NEWTON-EULER  BACKWARD  RECURSION  VARIABLES: 


Introduce  the  auxiliary  variable  W„.,„  which  will  represent  products  of  the  coordinate 
matrices  A. .  Notation  sometimes  used  elsewhere  in  the  literature  is  "-l  WO,,  but  we  write  WJib 
here  for  consistency  with  other  variables.  Let 

b 

w-* 


-  w.,.kw,k+»,h 


It  can  readily  be  seen  that  M',,f  maps  the  coordinates  of  a  vector  expressed  in  the  system  0t. 
into  coordinates  expressed  in  The  physical  significance  of  the  combining  form  0  is 

to  compose  a  mapping  which  relates  0>  and  C-.  with  a  mapping  which  relates  Ok  and  <D„_ 1 
in  order  to  produce  a  mapping  which  relates  the  coordinate  systems  0,  and  0„- 1.  Thus 
Wa.„  =  A„  and  W„ =  /  (the  identity  matrix). 

In  the  case  of  W«,h,  the  combining  operator  0  is  matrix  multiplication  and 


=  Wa.kWk  +  1.6 


Hereafter  we  will  merely  display  the  combining  form  without  drawing  explicit  attention  to  the 
operator  0,  acknowledging  implicitly  that  X„ hs  Xa,* 

Next  consider  the  angular  velocity  u>,.  We  define  *(_i)  to  be  the  axis  of  rotation  of  the 
base  system  and  g0  to  be  its  magnitude,  so  that  w0  =  A%z (-i)q0-  This  will  be  non  zero  if,  for 
example,  it  is  desired  to  include  the  Earth’s  rotation  in  the  calculations  (perhaps  for  satellite 
applications).  If  one  creates  a  fictitious  frame  0(_i)  at  the  Earth’s  center,  then  )>  points 
through  the  North  Pole  and  q0  is  equal  to  the  Earth's  angular  velocity. 

We  show  that  ui,  satisfies  the  following  closed-form  non-recursive  formula 

X 

w,  =  *22 

J-0 
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because  this  is  a  fixed  point  of  the  recursive  formula  for  u>,  given  in  Table  1.1. 

t—i 

j  cr«*«— 19.) 

j-0 

=  -f  $,) 

In  order  to  produce  an  identical  form  at  A  =  0  and  6  =  *,  define 

6 

j— a 

*  6 

^  +  5Z  Wjj-OjZj  — 19; 

1  —  u  y^*  4- 1 

This  is  the  combining  form  for  w„ .fc. 

w,..;  expresses  the  angular  velocity  of  0,  relative  to  0„ _ ■ ,  referred  to  Ct-  Thus  wi.,  is 

the  angular  velocity  of  C  relative  to  the  base  frame,  and  w0.,  also  accounts  for  the  rotation 
of  the  base  frame  (if  ?„  -  0  then  these  are  the  same).  The  combining  form  for  w„,(,  is  a 
means  of  composing  the  angular  velocity  w*  +  M,  of  0,,  relative  to  0„  with  w,,.*,  that  of  On 
relative  to  in  order  to  obtain  the  angular  velocity  of  J.  relative  to  C,  In  particular, 

iv..,  =  0.  The  rotation  matrix  ,  in  the  formula  transforms  u„,s  from  0*  into  the 

system  to  which  wk+lif,  is  referred. 

Similar  remarks  apply  to  the  physical  significance  of  wa,h,  pa-(,.  etc.  Derivations  of  the 
other  propagated  recursion  formulae,  shown  in  Table  IV. 1,  are  given  in  Appendix  B.  Note 
that  w0  is  the  angular  velocity  of  the  base,  non  zero  if  the  Earth's  rotation  is  modeled  as 
in  some  satellite  applications.  Also,  p0  is  the  acceleration  of  the  base.  Typically  this  is  the 
acceleration  due  to  gravity  at  the  site.  If  one  took  w0  /  0  then  p0  may  also  include  a  term  for 
w0  x  (w0  x  Pq),  where  p'u  is  a  vector  from  the  Earth's  center  to  the  site;  this  accounts 
for  the  centripetal  acceleration  arising  from  the  rotation  of  the  Earth.  One  may  account  for 
gravitational  acceleration  by  taking  pi  =  9<  else  p?  =  0  for  i  ^  0.  The  term  involving  p ®  is  a 
technical  artifice  to  account  for  p0  cleanly,  and  vanishes  in  the  combining  form. 

As  mentioned  in  Section  II  on  notation,  we  slightly  abuse  the  dot  notation  for  time 
differentiation  (e  g.,  iia,6)  to  indicate  the  time  differentiation  of  ua.t.  from  within  the  coordinate 


system  0„_,.  This  is  equivalent  to  the  standard  notation  at  a  =  0  (the  case  of  interest),  and 
results  in  a  less  bulky  notation.  To  illustrate  this  usage,  we  consider  an  alternate  derivation 
of  wa.(,,  where  £  denotes  time  differentiation  in  0, ,  and  denotes  uJ-(,  referred  to  0C. 


ua,b  —  — ~Z —  Wa,l> 


dt 


[bu>a,k  + 


<tt 

b.  . 


—  ‘Va  ,*  "E 


dt 


<*>k  + 1,6 


—  Wj.  +  l  b<l)a  k  +  kVa.k  X  1  .fc 

—  Wj  +[  •+  (WT-t-l.l  u*.k  )  x  W*  +|,fc  +  W/,  +  I 


which  is  our  combining  form. 


NEWTON  EULER  FORWARD  RECURSION  VARIABLES 


On  the  Newton-Euler  forward  recursion,  the  coordinate  matrix  products  of  interest  will 
be  W„  r,.k.fl  instead  of  W/+l  This  is  because  we  wish  to  transform  from  ,  to  0„, 
instead  of  from  to  0b.  Intuitively  speaking,  we  are  now  working  from  the  tip  of  the 
manipulator  back  toward  the  base.  Hence  we  are  desirous  of  transforming  some  relational 
parameter  AT,  which  abstracts  the  subchain  from  coordinate  system  (or  link  or  joint) 
k  -4-  l  to  A  and  is  expressed  in  !),  .  . .  into  the  equivalent  quantity  expressed  in  q  The  V- .  . 
so  transformed  can  be  combined  with  Xa,k,  representing  the  subchain  between  a  and  k 
expressed  in  Ca.  to  yield  Xa.b  expressed  in  Ca.  We  will  here  assume  that  the  necessary 
products  tVa+li)+i  are  generated  on  the  forward  recursion  in  accord  with  the  formulae 
above.  This  requires  a  minor  abuse  of  notation,  as  the  combining  form  would  then  be 

Wo  +  l.b-f.)  =  +  ®  Wt-4-2,6+1 

but  we  have  agreed  that  our  combining  forms  will  be 

Xa,k  —  Xa,k 

If  the  reader  finds  this  troublesome,  we  suggest  that  she  or  he  make  the  substitutions 
a!  =  o+  l,  V  =  6-1-1,  and  k'  =  k  4- 1.  This  done,  one  need  only  remember  that 

Wa',o'  == 
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Table  IV. 1  —  Logarithmic  Recursive  Formulations 
Logarithmic  Backward  Recursion: 


(1  ,  joint  a  rotational, 

0  ,  joint  a  translational. 

9a  =  1  —  a  a 

A, i  if  a  =  6; 

if  a  b. 


Wa.t, 


jA 


w„.*  — 


Qa,  b  — 


if  a  —  6; 

if  a  ^  6. 

f.<K  *(„_i)tj0 

+  W{k+\).hu 

a.0  X  W()r  +  I).i 

&nAa  za — i  Qa 

if  a  =  6; 

Wjk+i)bQa.k  +  Q(*+i),(, 

if  a  ^  6. 

a  =  b 
a  jt-  b 
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Table  IV.  1  —  Logarithmic  Recursive  Formulations  (continued) 


Logarithmic  Backward  Recursion  (continued): 


• 

Po 

if  a  =  6; 

if  a  6. 

<*>....•  x  p* 

if  a  =  b; 

4  (^//l  4- 1  ).hwu 

.*•)  X  4-  S,*+n.t 

if  a  7^  6. 

Pa. I  = 


Pa  4-  wu.n  x  p*  +  «».,  X  (a/„.„  X  p’)  4-  4-  2w0.a  X  9„.a 

^(A4-|  ),*,Pu.A  4-  p,  *  +  ,).!.  4"  (H^n  I  )  x  /?(*.(.  1 ), I, 

4-(M,,^  +  |(.|.<Vj.*-)  X  ^(lV(rk  +  1)<tW0,t)  X  /?(*  4!).'  +  2(S<*4-1)>  4- 


Logarithmic  Forward  Recursion: 


r 

lW0+i,)[+i/(k-t.1),i)  4-  fa,k 


if  a  =  6; 
if  o  7^  b. 


^o,6 


IN*  4-  s‘a  X  Fa 

na.k  4-  Wa4-i,k4-ifn()k4.)),h  4-  x  /(*  +  i),t 


if  a  =  ft; 
if  a  b. 
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5.  OPTIMIZED  LOGARITHMIC  RECURSION 


Because  here  the  extension  to  the  translational  case  may  not  be  entirely  obvious,  the  equa¬ 
tions  and  analyses  presented  cover  both  rotational  and  translational  joints  in  manipulators 
composed  of  mixed  systems.  Analogously  to  Section  III,  as  a  conceptual  aid  we  assume  that 
there  is  one  group  of  parallel  processors  for  each  node  shown  in  Figure  IV.2,  and  also  one 
group  for  each  node  in  the  similar  forward  recursion.  However,  only  one  row  (tier)  is  active 
in  the  computation  at  any  one  step,  and  all  tiers  are  identical.  Thus  an  implementation  could 
be  constructed  using  only  one  physical  device  for  each  pair  of  joints  of  the  manipulator, 
by  connecting  the  outputs  back  to  the  inputs  through  a  buffer.  Implementation  details  are 
expanded  further  in  Section  VI.  The  comments  about  the  4  Flop  systolic  pipeline  interval, 
found  in  the  introduction  to  Section  III.  also  apply  here. 

The  detailed  internal  structure  of  a  node  is  presented  in  Table  V.1,  which  is  analogous  to 
Table  III .  1 .  Note  that  the  formulae  presented  in  Table  IV.  1  for  the  forward  recursion  require 
calculating  W,_, ,  ,  . ,  and  again,  exactly  as  was  done  on  the  backward  recursion.  These 
have  no  interesting  data  dependencies,  do  not  bound  the  computation,  and  are  computed 
exactly  the  same  on  the  forward  as  on  the  backward  recursion  —  thus  for  conciseness  they 
are  shown  only  once  in  Table  V.1. 

ueiay  conditions  tor  the  iogarithmic  case  proceea  as  snown  in  Mppenaix  C,  similarly  to 
Section  III. 

Since  it  is  possible  to  satisfy  the  minimum  delay  conditions,  propagation  occurs  at  a 
rate  of  (MV  -(-  VA)  per  node.  Assuming  that  the  delay  conditions  are  satisfied  and  that  all 
input  becomes  available  simultaneously,  it  can  readily  be  seen  that 

Avatl(Xa.b)  =  Avatl(Xa.a )  -f  flog2(6  -  o  +  1)1(A/V  -f  V A). 

Thus  in  particular,  if  X,  is  the  linear  recursive  variable  corresponding  to  Xa,h,  then  X,  =  Xo,, 
so 

Avail(Xt)  =  Avail(Xo.,) 

=  Avoxl(X„,a)  +  flog2(.  +  1)1(MV  +  VA). 

Assuming  a  maximally  parallel  implementation  as  before,  the  total  time  of  the  calculations 
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is  shown  in  Appendix  C  to  be 

2,log_,(n  -f  l)[(l  Mult  3  Addns)  -f-  5  Mults  -f  10  Addns. 

For  the  same  reasons  as  in  Section  III  this  is  slightly  lower  (by  2  M  .Its  2  Addns)  than  given 
in  [25],  due  to  differences  (not  requiring  additional  computation  by  the  host)  in  assumptions 
about  the  form  of  the  input  and  output  (not  in  the  computation). 


Table  V.1  —  Relative  Time  of  Logarithmic  Data  Dependencies 
Logarithmic  Recursive  Forms;  a  ^  b 


Logarithmic  Backward  Newton- Euler  Recursion 


Var. 


Waits  On{ 


Time  at  Step  End 


Step 


Costt 


Wn.b 


a  =  Avail(WJ%il) 


a  +  M  M 


w,.,  =  W„.vWA  +  ).(, 


MM 


a  —  Avail(W2 
b  —  Avail(ojj^) 


a  j4t>o»l(VV.,.,) 


max(a,  6)  H-  MV 
ma x(a,  b)  +  MV  +  V A 
rnax(a,c)  t  MV 


T,~W[+uiu>a.k 


MV 


T1)  ~f~  1 


v_a_ 

MV 


t  XIlV  =  both  of  Xa,k  and  Xk+i,b 
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Table  V.1  —  Relative  Time  of  Logarithmic  Data  Dependencies  (continued) 
Logarithmic  Recursive  Forms;  a  /  6 


Logarithmic  Backward  Newton  Euler  Recursion 

Var. 

Waits  On)  '  Time  at  Step  End 

Step 

^ostt 

P.,.- 

a  —  Aval' 

ma x(a  -f  MV,c  -+■  MV,d)  +  VC 

Tio  =  T2  X  /?*r  4-1.6 

VC 

b  —  Avail(ujj_y) 
c  =  Avail(ujs^) 

max(e,  /)  4-  V A 

7,,  -  5n-t-|.l  4 

VA 

max(e,  f)-t  V  A  +  SV 

7,,  =  27,, 

SV 

d  —  Avtul(R ,  , ) 
e  A\ail(S;.,.) 

/  -  Avr.il(Q  ) 

g  -  Avail(p .  ) 

i 

; 

max(a  -  MV  +  VC,  b  -  MV  - 1-  VC. 
d  r  VC,e  t  VA  4  SV', 
j  .  V  A  ■  SV)  +-  VA 

7,,  7,.  4  7,, 

VA 

. 

inax(a  -•  M  V  -+■  VC,  b  -4-  M  V  V C , 
d-r  VC,e  +  VA  +  SV, 
f  +  VA-+  SV)4-  VC  +  VA 

7m  =  7,  X  7,;, 

VC 

niax(a  MV  2 VC  +  V A, 
b  t  MV  -f-  2VC-F  VA, 
c  -t-  MV  ■+  VC,  d  +  2 VC  4-  VA, 
r  4  VC  4  2 VA  +  SV, 
f  -r  VC  ~  2 VA  4-  SV)  4-  VA 

7,  s  =  7,o  +  7m 

VA 

! 

max(a,  g)  4-  MV 

7.„  = 

MV 

max(a  +  MV  +  2 VC  4-  IV A, 
b  4-  MV  4-  2 VC  4-  2VA, 
c  4-  M  V  4-  VC  4-  VA,  d  4-  2 VC  +  2VA, 
e  4-  VC  4-  3VA  4-  SV, 
f  +  VC  +  3VA+  SV,g)  +  VA 

7,t  =  p*+i,(,  4-7,5 

VA 

max(a  4-  MV  4-  2VC  4-  3VA, 

6  4-  MV  4-  2VC  4-  3VA, 
c  4-  MV  4  VC  4  2VA,  d  4-  2 VC  4-  3VA, 
e  4-  VC  4-  4VA  f  5V, 
f  +  VC+4VA+  SV,g+  MV)  +  VA 

Pa.t.  =  ^16  +  Tl7 

VA 

t  See  end  of  continued  table. 

|  Xx,y  =  both  of  Xa,k  and  X*+i,6 
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Table  V.1  —  Relative  Time  of  Logarithmic  Data  Dependencies  (continued) 
Logarithmic  Recursive  Forms;  a  ^  b 


Logarithmic  Forward  Newton-Euler  Recursion 

Var 

'  Waits  OnJ 

Time  at  Step  End 

Step 

Cost!  | 

fa., 

h  =-■  Avait(fj, 

,)  |  h  r  MV 

r,,  = 

MV 

|  h  -r  MV  -r  VA 

'  fa.t-  —  /..»  -1-  T,g 

VA 

n  ,  , 

k  -  Avail(f ... 

„)  (  (already  computed  *) 

!  T2o  «  A}^,R,.i- 

( MV *) 

i  Aiail(n 

.)  '  h  ■  VC 

T<\  -  T.„  x  /, 

VC 

h  ■  .MV  -  VC 

V,,  .-IV,.,,.  ..,71, 

MV  : 

rnax(i,/i  -r  MV  -  \'C) 

-  VA 

r.,  --=  n,.k  •  7V.  ; 

1 

VA 

l 

■  i  -MV 

71.,  -  IV  ...  ...n.  . 

MV 

[  max(t,  h  -r-  VC  -f-  V A) 

n„j ■  =  7V,  71.) 

VA 

L 

-i-MV-e-VA 

"  (that  is,  computable  before  the  recursion  reaches  the  node,  except  at  the  initial  node) 

t  A',.,.  both  of  A',,.,  and  Xk 


*  VA  time  cost  of  Vector  Addition 
VC  — time  cost  of  Vector  Cross  product 
SV  tune  cost  of  Scalar  multiplication  of  a  Vector 
M  V  — time  cost  o ]  Matrix  multiplication  of  a  Vector 
MM  =- time  cost  of  Matrix  Multiplication 
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6  IMPLEMENTATION  CONSIDERATIONS 


As  noted  earlier,  the  main  thrust  of  this  paper  has  been  an  exploration  of  the  extent  to  which 
parallelism  can  be  pushed  in  this  particular  domain.  Nonetheless,  we  hope  to  indicate  that 
the  physical  requirements  are  not  particularly  excessive  and  thus  this  might  be  a  reasonable 
thing  to  actually  implement.  This  section  contains  a  fairly  general  discussion  of  hardware 
requirements  Chip  architecture  is  discussed  in  the  next  section. 

Except  perhaps  for  some  variation  in  the  constant  term,  the  hardware  sketched  in 
this  section  is  intended  to  capture  the  maximally  parallel  nature  of  the  algorithms  and  to 
physically  attain  the  stated  time  bounds  It  is  always  possible  to  deliberately  sacrifice  speed 
and  gam  material  economy  by  allowing  hardware  multiplexing  in  a  partially  parallel  system. 

The  description  of  the  algorithms,  the  data  dependencies,  and  the  timing,  have  been 
developed  quite  generally  in  terms  of  matrix  and  vector  operations.  One  can  easily  imagine 
many  different  ways  of  implementing  matrix  arithmetic,  however.  The  purpose  of  this  section 
will  be  to  imagine  these  in  sufficient  detail  to  conclude  that  the  algorithms  proposed  are 
reasonable  We  also  imagine  that  any  actual  physical  implementation  will  differ  in  many 
particular  details  from  those  presented  here  That  is.  we  expect  that  the  details  of  the 
arrangements  and  requirements  which  we  display  will  change,  but  not  by  so  much  as  f't 

icnuei  iiie  yeueiui  uui illusions  inapplicable. 

Much  that  we  will  observe,  however,  can  be  seen  as  a  consequence  of  the  macro¬ 
scopic  computational  structure  of  the  algorithm  itself.  Where  possible  we  will  conduct  the 
discussion  at  a  level  of  matrix-vector  arithmetic  whose  operators  are  expressed  in  differing 
implementations,  and  we  will  seek  to  discover  regularities  which  constrain  the  design  across 
different  implementations. 

We  will  generally  worry  about  three  main  concerns  which  commonly  dog  parallel  systems: 

(a)  internal  buffering  and  storage  of  intermediate  results, 

(b)  communication  and  bussing  of  intermediate  results,  and 

(c)  number  of  processors. 

We  will  propose  a  general  computational  structure,  then  use  that  to  consider  requirements 


for  internal  buffering  and  storage.  This  will  lead  us  to  the  communication  and  bussing 
necessary  to  get  the  data  to  the  appropriate  storage  locations.  Finally  we  consider  the 
processing  required. 

We  accept  several  intuitions  commonly  held  in  VLSI,  including:  as  dimensions  scale 
downward  complexity  scales  upward,  which  leads  us  to  seek  simplicity  and  regularity;  and 
as  dimensions  scale  downward  processing  power  becomes  cheap  and  communication, 
bussing,  and  buffering  become  the  limiting  factors.  We  are  for  now  willing  to  assume 
global  communication  (in  particular,  a  global  clock  and  a  global  reset  signal)  This  avoids 
many  problems  of  timing  and  handshaking  protocol.  We  further  assume  that  timewise, 
1  Mult  =  1  Add*  =  1  Flop  This  substantially  simplifies  timing,  and  would  not  be  an 
unreasonable  simplification  to  impose  on  a  (synchronous)  physical  implementation  anyway. 
In  the  discussion  of  buffering  in  particular,  we  will  use  this  metric  (together  with  the  relative 
offsets  and  more  general  timing  developed  in  Sections  III  and  V)  to  establish  the  cases  in 
which  buffering  is  or  is  not  required. 

Note  that  the  logarithmic  formalism  as  developed  in  the  text  above  required  a  separate 
handling  of  the  a  —  6  case.  This  would  certainly  be  the  way  one  would  build  it  from 
multipliers  and  adders  in  order  to  achieve  the  tightest  possible  time  bound.  However,  if  the 
most  likely  implementation  of  the  logarithmic  algorithm  will  be  in  VLSI  or  WSI  it  is  here  more 
interesting  to  devise  a  single  regular,  systematic  structure  which  uniformly  handles  both  the 
a  —  b  and  the  a  ^  b  cases.  Since  a  =  b  only  happens  once  each  direction  we  are  quite 
happy  to  purchase  simplicity  and  regularity  at  the  price  of  an  increase  in  the  constant  term. 
Appendix  D  presents  a  technical  artifice  by  which  we  can  make  the  a  -=  b  case  look  like 
a  ^  b. 

The  conceptual  structure  of  the  computation  was  shown  in  Figure  ill.  1  (for  the  linear 
case)  and  Figure  IV.2  (for  the  logarithmic  case,  with  only  the  backward  recursion  shown). 
The  processor  nodes  could  be  hooked  up  in  a  regular,  regularly  extensible  structure  very 
similar  to  Figures  III.  1  and  IV.2.  This  would  permit  global  pipelining  of  successive  sets  of  joint 
torques  at  intervals  of  4  Flops.  However,  note  that  only  one  node  (linear)  or  tier  (logarithmic) 
is  ever  active  at  any  given  point  in  the  computation.  We  could  therefore  map  our  full 
conceptual  structure  into  a  much  smaller  one,  which  buffers  intermediate  results  in  such  a 
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way  as  to  use  only  one  physical  processor  node  (linear)  or  tier  of  nodes  (logarithmic).  This 
would  use  much  less  hardware,  but  would  imply  that  computation  of  one  set  of  joint  torques 
must  complete  before  the  next  could  begin.  The  basic  notion  (which  will  be  expanded 
more  fully  in  the  discussion  below)  is  shown  in  Figure  VI.  i  a,  for  conceptual  orientation.  The 
backward  and  forward  recursion  nodes  are  never  simultaneously  active,  so  processors  could 
be  shared  between  them;  see  Figure  VI. 1b.  The  choice  of  architecture  will  depend  heavily 
on  whether  it  is  desired  to  systolically  pipeline  successive  sets  of  joints  torques  at  4  Flop 
intervals,  or  simply  to  calculate  one  set. 

BUFFERING 

As  shown  in  Figure  VI.  1b,  there  are  four  sets  of  buffers  and  communication  pathways  with 
which  we  will  concern  ourselves;  input  buffering,  intra-node  buffering,  inter-node  buffering, 
and  backward-forward  recursion  buffering.  These  may  be  roughly  divided  into  internal  and 
external  buffers.  From  input/output  considerations  we  can  put  immediate  bounds  on  the 
amount  of  buffering  external  to  a  node  which  may  be  required  (this  is  input,  inter-node, 
and  backward  forward  recursion  buffering).  There  are.  after  all.  only  so  many  variables 
to  transmit.  The  potential  problem  lies  in  the  internal  buffering  within  any  one  processor 
node,  for  we  have  created  a  whole  host  of  intermediate  partial  computations  which  have 
complicated,  intertwined  data  dependencies.  The  potentially  massive  buffering  required  for 
these  is  actually  very  tightly  bounded. 

Input  buffering  requires  storing  the  scalar  values  for  and  A,,  as  well  as 

the  manipulator  configuration  parameters  (p*.  pf ,  r*.  s*,  m,,  and  J ,)  and  the  endpoint 

values  (wo,  w0.  p0,  /„+i  and  n„+i),  for  a  total  of  about  37 n  -f  15  scalar  values.  Buffering 
within  any  one  node  in  the  linear  case  requires  at  most  15  scalar  values  of  temporary  storage 
since  (referring  to  Table  111.1 )  only  w,,  Ta,  Tn,  Ti;,  and  T1t  are  not  used  immediately,  but  this 
requirement  can  be  met  by  simply  latching  the  output  of  the  sub  processors  into  registers 
(as  in  the  next  section).  This  requirement  is  only  3  scalars  in  the  logarithmic  case  (for  T\), 
and  it  can  also  be  met  by  latching.  Buffering  between  nodes  is  not  needed  in  the  linear 
case  and  requires  about  54  scalars  per  node  in  the  logarithmic  case  (if  outputs  are  latched, 
this  can  nearly  be  halved)  Backward-forward  recursion  buffering  involves  at  most  only  the 
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passed  recursion  variables  from  each  joint.  (Additional  intermediate  storage  is  needed  if 
a  systolic  pipelined  architecture  is  implemented,  by  about  a  factor  of  n,  linear,  or  [log^n], 
logarithmic.)  Buffering  requirements  are  considered  in  more  detail  in  [25]. 

COMMUNICATION 

We  do  not  wish  to  propose  a  detailed  communication  protocol,  only  to  show  that  the 
requirements  are  sufficiently  bounded  that  such  a  protocol  could  be  devised.  To  develop 
this,  we  shall  discuss  communication  in  terms  of  “wires"  which  are  vector  busses  three 
scalars  wide,  and  not  allow  multiplexing  of  busses.  In  a  real  implementation,  data  on 
the  '  vector  busses  three  scalars  wide'4  might  be  transmitted  serially  as  three  sequential 
numbers  over  only  a  single  real  wire.  Also  in  a  real  implementation,  the  busses  might 
be  time  multiplexed.  Thus  in  the  discussion  below,  the  reader  should  supply  her  or  his 
own  scale  factor  depending  on  her  or  his  own  implementation  image.  After  developing 
communication  models  for  Figures  VI.  1b,  we  will  show  how  the  systolic  pipelined  structures 
of  Figures  III.  1  and  IV. 2  have  a  regular,  regularly  extensible  communication  network. 

As  in  the  question  of  buffering,  communication  falls  into  that  internal  and  external  to  a 
node.  The  internal  communications  may  be  treated  as  comprising  a  fixed  size  hard  wired 
module  in  which  the  '  wires’4  are  laid  down  as  dictated  by  the  algorithm.  The  external 
communication  must  carry  exiernaiiy  buffered  data  back  and  forth  between  the  process 
nodes  and  the  buffers. 

Bounds  may  be  placed  on  the  internal  communication  by  noting  the  following: 

(a)  as  shown  below  (when  number  of  processors  is  considered),  none 
of  the  linear  forward  or  backward  recursion,  the  logarithmic  forward  or 
backward  recursion,  or  the  non-propagated,  variables  have  more  than  25 
matrix-vector  operators, 

(b)  no  operator  has  more  than  two  operands,  and  therefore 

(c)  no  single  module  of  the  above  (a)  has  more  than  50  point-to-point 
busses. 

If  the  forward  and  backward  recursion  share  processors  then  this  must  be  increased  to  a 
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maximum  of  35  operators,  hence  70  busses.  The  non  propagated  variables  have  no  more 
than  10  operators,  hence  20  busses,  and  these  must  be  included  somewhere  as  noted 
above.  We  stress  that  this  is  an  upper  bound  based  only  on  the  total  number  of  processors, 
and  in  any  real  implementation  the  actual  number  would  doubtless  be  much  smaller.  This  is 
because  many  intermediate  results  are  used  by  only  one  operator,  so  we  expect  that  many 
operands  could  be  transmitted  by  abutment,  by  simply  connecting  one  operator’s  output 
directly  to  the  input  of  the  next  operator.  This  will,  however,  depend  on  the  particular 
implementation  layout  geometry  chosen.  The  essential  point  is  that  there  are  easily  few 
enough  wires  to  route  on  a  point-to-point  basis. 

For  external  communication  we  consider  the  paths  to  and  from  the  buffers  separately. 
We  essentially  argue  that  since  the  external  buffering  requirements  were  not  excessive,  the 
communication  required  to  support  those  requirements  will  not  be  excessive.  No  inter  node 
buffering  was  required  in  the  linear  case,  but  the  variables  themselves  must  be  transmitted 
—  this  requires  four  vectors  on  the  backward  recursion  and  two  on  the  forward.  The 
logarithmic  case  required  buffering  at  most  18  vectors  (54  scalars)  per  node  per  (MV  -f  V A) 
cycle  (both  backward  and  forward  recursion),  hence  at  most  18  vector  busses.  The  input 
values  (18  scalars  per  node)  and  the  endpoint  values  (wn.  etc.  —  12  scalars  at  the  first 
node  of  the  backward  recursion  and  6  scalars  at  the  first  node  of  the  forward)  must  be 
commumcatea.  Finally,  eacn  node  must  eventually  communicate  t,  and  N,  to  tne  backward- 
forward  recursion  buffers. 

The  systolic  pipelined  structure  of  Figures  111.1  and  IV.2  has  particularly  nice  scaling 
properties  as  n  increases.  These  structures  do  not  fold  the  nodes  or  tiers  together  as  does 
Figure  VI. la  or  b.  In  the  logarithmic  case,  the  reader  should  compare  Figure  IV.2  with 
Figures  VI. 2  and  VI. 3.  All  show  almost  identical  structure,  but  in  different  fashion.  Figure 
VI. 2  shows  Figure  IV.2  expanded  to  include  the  forward  recursion,  and  Figure  VI.3  maps  this 
into  a  regular  rectilinear  array  This  array  is  also  regularly  extensible  as  shown  in  Figure 
IV  4.  In  all  cases,  circles  are  processor  nodes  which  implement  one  complete  node  of  the 
logarithmic  algorithm.  Additionally,  in  Figure  VI.3  squares  represent  passive  buffers  which 
do  no  more  than  perform  buffering  for  the  variables  transmitted  from  node  to  node,  or  from 
backward  to  forward  recursion.  In  the  linear  case  the  regular,  regularly  extensible  array 
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structure  can  be  clearly  seen  from  Figure  111.1.  Passive  buffering  is  only  required  between 
backward  and  forward  recursion  processors.  (We  assume  that  the  non-propagated  variables 
are  computed  from  the  propagated  variables  within  the  circles  as  appropriate,  in  the  manner 
discussed  above).  All  variables  pertaining  to  the  same  time  slice  are  thus  calculated  in  the 
same  node  (linear)  or  tier  (logarithmic),  and  progress  node  by  node  (tier  by  tier)  untH  they 
finally  emerge  as  the  desired  torques.  Each  node  or  tier  requires  ( MV  -f  V A )  =  4  Flops  to 
complete,  so  if  different  successive  input  values  were  presented  at  intervals  of  4  Flops  ai  the 
bottom,  corresponding  motor  torques  would  emerge  from  the  top  at  intervals  of  4  Flops.  It 
seems  likely  that  the  speed  would  be  bounded  by  the  input/output  requirements  of  the  host 
system 

The  thing  to  notice  about  the  linear  case  in  Figure  111.1  is  that  it  is  possible  to  restrict  the 
total  width  of  the  busses  between  successive  nodes  to  the  width  required  for  the  inter- node 
communication  of  any  single  node,  and  as  argued  above  this  is  finite  and  small.  In  the 
logarithmic  case  (Figure  VI.3).  the  maximum  total  width  of  the  busses  may  be  restricted  to 
twice  this.  Thus  as  the  structure  is  scaled  upward  (i.e.  as  n  becomes  arbitrarily  large),  the 
amount  of  area  consumed  by  busses  remains  a  constant  fraction  of  the  total  area. 

NUMBER  OF  PROCESSORS 

Consider  next  the  number  of  processors  required.  Tables  111.1  and  V.1  were  constructed 
to  represent  exactly  each  operation  at  a  node  exactly  once  (except  that  as  noted  in  Section 
V.  Wj.y  and  RltV  are  calculated  in  the  same  way  on  both  the  logarithmic  backward  and 
forward  recursions,  so  for  conciseness  were  shown  only  on  the  backward  —  here  they  must 
be  counted  in  the  forward  recursion). 

Since  the  time  denotation  given  in  Tables  111.1  and  V.1  (MV,  VA,  VC,  etc.)  also 
denotes  the  operation,  we  see  that  to  propagate  the  recursive  variables  through  a  single 
node  requires 


Requirements  Per  Node  —  Propagated  Variables  Only 


Algorithm 

MV 

VA 

VC 

SV 

MM 

Total 

Linear  Bkwd.  N.E. 

. . . 

3 

5 

4 

12 

Linear  Fwd.  N.E. 

2 

4 

2 

8 

Logarithmic  Bkwd.  N.E. 

6 

12 

4 

1 

1 

24 

Logarithmic  Fwd.  N.E. 

5 

4 

1 

1 

11 

Additionally,  one  must  also  compute  the  non-propagated  variables  (r, ,  F,,  N,,  and  r,  from 
Table  III.  1 ).  which  is  the  same  for  both  the  linear  and  the  logarithmic  cases. 


Requirements  Per  Node  —  Non- Propagated  Variables  Only 

Algorithm 

MV 

VA 

VC 

SV 

MM 

Total 

Backward  Recursion 

2 

3 

4 

1 

10 

Forward  Recursion 

0 

The  total  per  node  requirements  are  thus 


Requirements  Per  Node 


Algorithm 

MV 

Total 

Linear  Bkwd.  N.E. 

5 

8 

8 

1 

22 

Linear  Fwd.  N.E. 

2 

4 

CM 

J 

8 

Logarithmic  Bkwd.  N.E. 

- 1 

8 

15 

8 

2 

1 

34 

Logarithmic  Fwd.  N.E. 

5 

4 

1 

1 

11 

—  Totals 


An  interesting  optimization  is  possible  in  the  logarithmic  case  if  it  is  not  required  to 
model  the  rotation  of  the  earth  and  if  n  is  even.  Recall  from  Section  VI  that  the  recursion 
was  grounded  in  0(_ ,j,  and  that  we  considered  2I_1J  as  a  vector  pointing  through  the  North 
Pole,  etc.  This  was  primarily  done  to  provide  a  firm  and  unambiguous  grounding  for  the 
mathematical  analysis,  and  in  many  cases  may  not  be  required.  The  logarithmic  algorithm 
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requires  one  processor  node  for  each  pair  (or  fraction  thereof)  of  input  nodes.  If  n  is 
odd  then  [(n/2)]  =  f(n  +  i)/2|  and  the  number  of  processor  nodes  is  (n  -f  i)/2  regardless. 
However,  if  »  is  even  then  [(n/2)]  =  [(n  -p  i)/2]  —  1,  and  by  grounding  the  recursion  in  0Q 
we  may  eliminate  one  processor  node.  Except  perhaps  in  satellite  applications,  typically  this 
loses  nothing  interesting  anyway. 

An  implementation  might  involve  a  special-purpose  VLSI  chip  capable  of  handling  general 
vector  arithmetic  up  to  and  including  matrix-vector  multiplication.  (This  is  chosen  to  be 
midway  between  two  alternatives  —  a  matrix-matrix  chip  would  be  larger  and  could  be 
implemented  using  three  of  the  proposed  chips,  while  a  vector-vector  chip  would  be  smaller 
but  three  could  implement  the  chip  proposed.)  The  chip  would  incorporate  18  registers  (for 
storing  three  pairs  of  3-vectors),  6  multipliers  and  3  adders.  It  would  also  need  a  means  of 
sequencing  these  operations,  as  well  as  a  means  of  decoding  (perhaps  from  jumpers  wired 
to  the  pins)  which  particular  vector  operation  to  perform. 

A  natural  candidate  is  a  datapath  chip,  as  in  Barrett  et  al.[3].  As  shown  in  Figure  VI. 5,  it 
consists  of  a  Programmable  Logic  Array  (PLA)  which  decodes  micro-instructions  to  produce 
control  signals  driving  computational  and  storage  elements  attached  to  several  common  data 
busses.  By  controlling  when  the  storage  elements  read  and  write  which  busses,  and  when 
and  from  which  busses  the  computational  elements  obtain  their  operands  and  output  their 
re?nitc  natijro  nf  tho  comp,jtstior*  p^rfoTPsd  on  chip  is  ccntrc!!sd.  Thus  3  chip 

functions  as  an  easily-customizable  micro-CPU.  This  will  be  explored  more  fully  in  the  next 
section. 

The  reader  can  gain  some  intuitive  feel  for  an  implementation  by  turning  to  Table  111.2, 
and  imagining  that  the  pages  represent  printed-circuit  boards  and  the  indicated  operations 
represent  chips.  The  structure  so  composed  would  implement  the  fully  systolic  pipelined 
linear  architecture  for  n  =  4.  As  suggested  in  the  next  section,  several  such  Vector 
Arithmetic  Modular  Processors  (VAMPs)  might  be  put  on  a  single  package,  depending  on 
fabrication  and  processing  technology,  so  the  actual  chip  count  may  be  lower.  To  avoid 
these  dependencies,  the  discussion  following  will  be  conducted  in  terms  of  VAMPs,  each 
capable  of  computing  one  vector  result,  and  three  together  of  computing  a  matrix-matrix 
multiplication. 
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Below.  "Single  Device"  (single  node,  linear  case,  or  tier,  logarithmic)  is  a  non-systolic 
configuration.  "Systolic  Pipeline"  in  the  logarithmic  case  requires  only  seven  nodes  total  on 
each  of  the  forward  and  backward  recursions,  in  light  of  the  optimization  discussed  above 
for  n  =  6  applied  to  Figure  IV. 2,  plus  six  more  if  the  a  —  b  case  is  handled  as  in  Appendix 
D.  The  algorithms  would  require  the  following  number  of  VAMPs 


VAMPs*  —  Propagated  Variables  Only  (n  =  6) 

Algorithm  '  Per  Node 

Tier,  n  —  6 

Single  Device  !  Systolic  Pipeline 

Linear  Bkwd.  j  12 

* 

12 

72 

Linear  Fwd.  J  8 

• 

8 

48 

Logarithmic  Bkwd. 

26  j  78 

78 

338 

Logarithmic  Fwd 

13  j  39 

39 

169^ 

*  Vector  Arithmetic  Modular  Processors  —  see  also  next  section 


VAMPs  —  Non-Propagated  Variables  Only  (n  =  6) 


i 

[Algorithm 
j  Linear  Bkwd. 

Per  Node 

-  - 

Tier,  n  —  6  J 

* 

Single  Device  !  Systolic  Pipeline  1 
10  j  60  ~ | 

j  Linear  Fwd.  j  0 

1 

* 

o 

0 

|  Logarithmic  Bkwd. 

10 

20 

1 _ “ 

60 

|  Logarithmic  Fwd. 

0 

0 

0 

0 

The  total  requirements  are  thus 


— 

VAMPs  —  Totals,  n  =  6 

Algorithm 

Per  Node 

Tier,  n  =  6 

Single  Device 

Systolic  Pipeline 

Linear  Bkwd. 

22 

_  . 

* 

. 22  _ 

132 

Linear  Fwd. 

8 

* 

8 

48 

Logarithmic  Bkwd, 

36 

98 

98 

398 

Logarithmic  Fwd. 

13 

39 

39 

169 
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Figure  VI. la  —  Non-Systolic,  Processors  Shared,  All  Joints 


Output 

Torques 


Output 

Torques 


Figure  VI. 1b  —  Non-Systolic,  Forward  and  Backward  Recursion  Processors  Also  Shared 


Tt*  *  - 


4 


Figure  VI.4  —  Regularly  Extensible  Logarithmic  Recursive  Graph 


7.  A  ROBOT  CHIP 


Having  seen  that  the  global  physical  requirements  are  not  excessive,  it  becomes  interesting  to 
look  more  closely  at  VLSI  chip  structures  capable  of  physically  implementing  the  computation. 
It  is  clear  from  the  discussion  above  that  we  require  a  chip  capable  of  performing  general 
matrix-vector  arithmetic.  Within  this  section  we  investigate  the  computation  and  control 
architecture  supporting  this  requirement.  We  will  sketch  one  architecture  suitable  for  our 
purposes,  though  of  course  there  are  others.  As  in  the  previous  section  we  imagine  that 
any  actual  physical  implementation  will  differ  in  many  particular  details  from  those  presented 
here,  but  not  by  so  much  as  to  render  the  general  conclusions  inapplicable. 

Obviously  such  a  chip  will  be  more  broadly  applicable  than  just  the  inverse  dynamics 
computation,  however,  and  we  will  seek  to  maintain  a  high  degree  of  flexibility  in  our 
implementation.  In  particular,  we  will  seek  to  insure  that: 

(a)  all  chips  used  be  identical,  or  at  least  interchangeable,  so  that 
only  one  chip-type  is  required, 

(b)  the  particular  operation  which  any  chip  performs  be  programmable, 

(c)  the  length  of  the  basic  time  cycle,  and  the  points  during  the  cycle 
when  operands  are  read  or  the  result  written  to  busses,  and  which  busses, 
be  programmable, 

(d)  the  host  computer  be  able  to  dynamically  reprogram  (b)  and  (c) 
at  will  by  simply  writing  the  appropriate  values  to  the  n  input  buffers,  in 
exactly  the  way  data  is  written  to  the  device,  and 

(e)  the  control  flexibility  of  (a)  through  (d)  must  not  use  any  com¬ 
munication  wires  or  buffering  other  than  those  already  required  to  support 
the  computation  and  discussed  in  the  previous  section. 

Later  in  the  section  we  will  consider  modifications  to  the  bussing  scheme  discussed  in  the 
preceding  section.  By  relaxing  (e),  we  will  sketch  an  architecture  whereby 

(f)  the  sources  of  the  operands  which  any  particular  chip  accepts  are 
programmable  (as  in  (d)),  and 
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(g)  a  limited  error-correcting  capability  may  be  incorporated,  so 
that  correct  computation  can  often  be  automatically  continued  following 
individual  chip  failures. 

It  should  be  explicitly  observed,  however,  that  the  architecture  presented  is  incapable  of 
matrix  inversion  or  similar  operations  (e  g.,  Gaussian  elimination).  It  would  also  be  useful 
to  incorporate  some  of  the  common  trigonometric  functions  and  inverses,  as  well  as  square 
root.  These  capabilities  are  required  for  many  interesting  applications,  but  require  much 
more  complex  control  and  condition  testing  than  we  need  for  basic  matrix-vector  arithmetic. 

The  basic  unit  of  most  matrix  vector  arithmetic  is  the  vector  dot  (inner)  product.  Matrix- 
vector  multiplication  is  accomplished  by  forming  the  dot  product  of  each  of  the  rows  of  the 
matrix  with  the  vector,  and  matrix-matrix  multiplication  by  forming  the  dot  product  of  each 
row  of  the  first  matrix  with  each  column  of  the  second.  Other  familiar  operations  (vector 
cross  (outer)  product,  vector  addition,  etc.)  may  be  performed  with  different  sequencing  of 
the  basic  dot  product  hardware. 

Our  strategy  will  be  to  compose  a  primitive  module  capable  of  computing  one  coordinate 
of  the  result  (i.e.,  one  vector  dot  product).  If  we  then  group  several  such  primitive 
modules  together,  along  with  suitable  control,  we  can  implement  all  necessary  matrix  vector 
operations.  The  number  of  primitive  modules  per  chip  is  a  design  decision;  the  control 
mechanisms  we  develop  support  different  choices.  Nine  primitive  modules  are  sufficient 
for  a  matrix-matrix  multiplication,  and  three  for  a  matrix-vector  multiplication.  The  basic 
architecture  established,  we  display  control  sequences  implementing  the  various  operations. 
Next  we  embed  the  mechanism  in  a  timing  structure  governing  when  in  the  cycle  operands 
are  read  and  the  result  computed  and  written.  The  internal  structure  explicated,  we  discuss 
how  the  control  information  might  be  loaded  under  programmed  control  from  the  host. 
Finally,  we  indicate  how  the  bus  implementation  of  chip  interconnect  might  be  extended 
to  allow  different  algorithms  to  be  dynamically  programmed,  and  a  limited  error- correcting 
capability. 

As  shown  in  Figure  VI.5,  a  datapath[3]  is  a  bus-oriented  architecture  composed  of 
stacked  computational  elements,  busses  running  through  them,  and  a  centralized  control. 
Typically  a  primitive  cell,  one  bit  wide,  is  replicated  in  the  horizontal  direction  to  the 


width  (in  bits)  of  the  datapath,  yielding  one  computational  element  (such  as  a  register 
or  adder).  For  floating-point  arithmetic  this  is  modified  slightly  so  that  the  mantissa  and 
the  exponent  circuitry  are  replicated  separately.  Different  computational  elements  are  then 
stacked  vertically  to  form  the  datapath.  One  or  more  busses  run  vertically  through  each  cell, 
and  these  may  be  connected  by  abutment  to  vertically  adjacent  elements  forming  vertical 
global  busses.  Control  lines  run  horizontally  through  each  cell,  control  being  frequently 
generated  by  a  Programmable  Logic  Array  (PLA)  on  the  side. 

For  datapath  elements  we  assume  the  following:  registers,  adders,  and  multipliers. 
(Other  elements,  such  as  comparators  and  dividers,  would  also  be  needed  to  implement 
matrix  inversion  or  Gaussian  elimination.)  Each  element  "talks"  to  both  busses  and  to  both 
vertically  adjacent  (abutting)  neighbors.  Thus  each  element  can  load  operands  and  dump 
results  to  and  from  the  busses  and  adjacent  neighbors  Additionally,  in  order  to  facilitate 
serial  communication  between  chips  and  minimize  wires,  we  assume  that  the  registers  are 
shift  registers,  and  can  shift  operands  in  or  results  out.  We  observed  while  developing  upper 
bounds  in  the  preceding  section  that  the  "vector  busses  three  scalars  wide”  might  actually 
be  implemented  as  a  single  multiplexed  wire  in  order  to  minimize  wire-count.  “Operand” 
below  will  typically  mean  the  three  scalar  values  which  make  up  a  vector,  unless  context 
clearly  indicates  otherwise.  (All  data  transfers  wiihm  a  module  are  done  in  parallel,  of 

rrinrco  \ 

Thus  the  basic  cycle  will  be: 

(a)  start  cycle, 

(b)  load  first  operand  from  off-module,  a  certain  delay  after  the  cycle 
starts, 

(c)  load  second  operand  from  off-module,  a  second  certain  delay  after 
cycle-start, 

(d)  compute  result, 

(e)  dump  result  to  off-module,  another  certain  delay  after  cycle-start, 

(f)  delay  until  cycle  ends,  another  certain  delay  after  cycle-start,  and 

(g)  go  to  (a). 

As  noted  above,  the  basic  cycle  length  of  the  inverse  dynamics  algorithm  considered  herein 
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is  MV  -f  VA  ~  4  Flops,  though  other  algorithms  will  have  other  basic  cycles. 

Consider  the  floor-plan  for  a  primitive  module,  shown  as  a  block  diagram  in  Figure  VII. 1. 
Assuming  that  the  operands  have  been  properly  loaded,  it  is  easy  to  see  that  it  can  compute 
one  coordinate  of  any  of  the  matrix-vector  operations  listed  at  the  end  of  Tables  111.1  and  V.l. 
(Note  that  to  enable  the  vector  cross  product  (VC),  the  two  multipliers  are  placed  between 
the  two  register  pairs  NOT  corresponding  to  the  result  component  calculated  by  the  primitive 
module.)  The  sequences  of  data  transfers  and  computations  required  for  each  are  given 
in  Table  VII. 1.  Only  two  global  busses  are  required,  and  that  the  number  of  multipliers  has 
been  reduced  from  three  to  two  (a  consequence  of  the  1  Mult  =  1  Addn  =  1  Flop  timing 
assumption,  as  noted  in  the  previous  section). 

Several  of  these  primitive  modules  will  operate  concurrently  to  produce  a  matrix  or 
vector  result;  since  our  vectors  are  in  Z:<,  this  number  will  normally  be  a  multiple  of  three. 
As  in  the  previous  section,  we  will  assume  groups  of  three  primitive  modules.  Figure 
VII. 2  shows  a  floor  plan  block  diagram,  together  with  control  and  (multiplexed)  off-module 
communication  (not  shown  is  the  ROM  or  RAM  program  storage).  Such  a  device  would 
be  capable  of  one  matrix-vector  multiplication,  or  the  three  coordinates  of  any  supported 
vector  vector  operation.  For  purposes  of  discussion,  we  will  take  this  Vector  Arithmetic 
Modular  Processor  (VAMP)  as  our  unit  of  computation. 

The  VAMP  operand  inputs  are  wired  (perhaps  through  intermediate  delay  buffers  as 
discussed  in  the  preceding  section)  to  the  result  output  of  the  VAMP  computing  that 
operand.  By  synchronizing  between  source  and  destination  VAMPs  the  respective  dump 
and  load  delays  after  cycle-start  (recall  that  we  are  willing  to  assume  a  global  clock  and 
reset),  intermediate  results  can  be  passed  to  successive  operators. 

Serial  off-module  communication  between  VAMPs  might  permit  several  VAMPs  to  share 
the  same  package  by  substantially  reducing  pin-out.  (The  trade-off  is  that  it  will  increase 
communication  time,  perhaps  requiring  the  insertion  of  additional  delays.)  Though  the 
physical  size  is  heavily  dependent  on  choice  of  process  technology,  width  of  the  data  word 
in  bits,  and  other  design  parameters,  we  can  make  some  initial  estimates  of  total  package 
count  based  solely  on  pin-out  (with  the  understanding  that  these  will  likely  be  revised 
upwards  depending  on  design  parameters  and  the  particular  fabrication  process).  In  most 


process  technologies,  each  VAMP  would  require  external  connections  to  Power,  Ground, 
Clock- 1,  Clock-2,  Global-Reset,  Operand-A,  Operand-B,  and  Result.  I(  each  VAMP  is 
wired  separately,  a  nominal  40- pin  package  would  provide  pin- out  sufficient  for  five  VAMPs 
to  support  these  eight  signals.  (Due  to  process  defects,  of  course,  they  must  be  individually 
tested  and  faulty  VAMPs  discarded.)  If  a  single  Clock  input  is  converted  to  Clock- 1  and 
Clock-2  (or  more  if  needed,  e  g  Pre-Charge)  on-chip  and  distributed  to  each  VAMP,  and 
if  Power,  Ground,  and  Global-Reset  are  also  brought  in  on  one  pin  each  and  distributed, 
then  a  nominal  40-pin  package  would  provide  pin  out  sufficient  for  i_  VAMPs.  As  was  seen 
in  the  preceding  section,  11  such  40- pin  packages  would  implement  the  full  systolic  pipeline 
for  the  linear  case  of  n  —  6. 

Alternatively  the  other  extreme  can  be  chosen,  and  each  VAMP  packaged  individually. 
Each  VAMP  could  be  put  in  an  8  pin  package,  resulting  in  very  small  cheap  chips.  This 
would  have  the  further  advantage  of  not  requiring  innovative  packaging. 

The  control  must  specify  the  delays  associated  with  communication  and  computation 
of  the  result,  and  which  operation  to  perform.  This  is  shown  schematically  in  Figure  VII.3. 
The  delays  are  easily  implemented  by  counters  and  comparators.  A  Cycle-Counter  is  zeroed 
at  each  cycle  start  and  incremented  at  each  Flop.  The  delay  associated  with  each  register 
load/dump  is  compared  to  the  Cycle  Counter  and  the  register  loaded  on  equality.  By 
insuring  mat  cycie  sian  occurs  syncnronousiy  on  an  cmps,  communication  timing  between 
operand  sources  and  destinations  can  be  coordinated  programmatically.  In  turn,  in  virtue  of 
accepting  a  global  clock  signal  we  can  insure  simultaneous  global  cycle-start  (e  g.,  perhaps 
by  bringing  Global-Reset  to  True  for  one  clock  period  only  at  the  beginning  of  each  cycle). 
This  globally  resets  counters  to  zero,  and  insures  that  accidentally  dropping  a  bit  or  missing 
a  clock  tick  does  not  throw  a  module  permanently  out  of  step. 

It  is  sufficient  if  the  device  is  able  to  execute  operations  from  only  a  small  fixed  repertoire, 
viz.  those  at  the  end  of  Tables  111.1  and  V.l.  This  being  the  case,  the  microcode  instructions 
needed  for  each  item  in  the  repertoire  may  be  stored  on- module  in  Read-Only-Memory 
(ROM).  The  operation  to  be  performed  can  then  be  specified  as  a  pointer  into  ROM,  e.g. 
as  the  high-order  address  bits  or  as  an  index  pointer. 

If  we  were  designing  a  very  general  machine  we  would  certainly  want  the  microcode 


73 


to  be  loaded  under  program  control  from  the  host.  Though  such  capability  is  not  needed 
for  most  matrix-vector  algorithms,  we  observe  that  Random  Access  Memory  (RAM)  could 
be  loaded  in  much  the  same  way  as  we  will  load  control  information.  The  pointer  into 
ROM  would  then  be  augmented  with  an  indicator  bit  selecting  RAM  or  ROM,  and  all  would 
proceed  as  sketched  for  the  ROM  only  case. 

Thus  we  see  that  communication  and  control  can  be  effected  by  loading  three  registers 
with  delays  for  the  comparators  and  one  register  with  a  ROM  pointer.  These  four  values  will 
be  referred  to  below  as  the  module's  “program".  Next,  consider  how  these  control  registers 
might  be  loaded  under  programmed  control  from  the  host.  (For  a  prototype  version  one 
might  simply  bring  the  control  in  from  off-chip,  each  control  register  bit  corresponding  to  a 
pin  wired  by  lumpers  to  either  power  or  ground.) 

Since  the  same  wires  are  to  be  used  for  both  data  and  control  the  chips  must  have 
a  way  to  distinguish  which  is  which.  Though  it  would  be  possible  to  tag  each  word  with  a 
control  bit.  it  is  preferable  to  use  the  Global-Reset  signal.  In  normal  operation  (Data  Mode) 
Global-Reset  is  (for  example)  brought  True  for  exactly  one  clock  tick  to  mark  cycle-start, 
then  immediately  returned  to  False.  If  instead  it  is  kept  True,  all  chips  enter  Program  Mode. 
While  in  Program  Mode  they  will  use  the  same  wires  in  much  the  same  way.  but  treat  the 
words  as  program  data  rather  than  computational  data.  In  Program  Mode,  Global-Reset 
is  (tor  example)  brought  False  for  exactly  one  clock-tick  to  mark  program  cycle-start,  then 
immediately  returned  to  True. 

The  system  must  be  programmable  from  any  conceivable  system  state,  in  particular  from 
random  or  partially  programmed  states.  Thus  the  delay  registers  must  be  assumed  invalid. 
All  transfers  occur  immediately  following  program-cycle-start.  The  host  writes  successive 
program  data  words  to  the  input  ports  and  these  are  clocked  through  the  network.  Each 
module  will  clock  in  program  data  as  operands,  operate  on  it  as  discussed  below,  and  clock 
out  program  data  as  result.  When  the  module  clocks  in  the  program  data  containing  its 
own  module  program,  that  data  will  be  loaded  into  its  control  register(s)  as  below.  After 
all  VAMPs  have  been  programmed  in  this  fashion,  Global-Reset  can  be  brought  and  kept 
low  forcing  Data  Mode.  Thereafter  normal  data  computation  can  occur  through  the  fully 
programmed  network. 
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Thus  we  must  insure  that  every  VAMP  is  directly  or  indirectly  accessible  to  the  host, 
and  that  each  VAMP  can  recognize  its  own  program  when  seen.  The  first  condition  is 
satisfied  by  the  connected  directed  acyclic  graph  nature  of  the  network;  any  module  not  at 
least  indirectly  accessible  to  the  host  could  never  receive  operands  and  hence  could  not 
participate  in  the  computation. 

One  way  to  allow  each  module  to  recognize  its  own  program  is  to  assign  to  each  one 
a  unique  number,  “burned  in”  as  in  Programmable-Read  Only-Memory  (PROM)  before  the 
chip  is  ever  inserted  into  the  network  (a  prototype  might  simply  have  jumpers  wired  to  pins). 
The  host  then  simply  writes  pairs  of  (VAMP-number,  VAMP-program)  to  the  network  inputs, 
and  these  are  clocked  through  the  network.  When  a  module  recognizes  its  own  number 
in  Operand-A  (i.e.  the  first  operand),  it  loads  the  associated  program  For  example,  the 
VAMP  number  might  be  in  the  first  scalar  component  of  the  vector,  the  VAMP-program 
contained  in  the  remaining  two  scalars.  Variations  on  this  scheme  could  also  allow  RAM  to 
be  loaded,  perhaps  using  the  second  scalar  component  as  an  address  and  the  third  as  the 
microcode  word  to  store  there.  The  VAMP  outputs  Operand  A  (unchanged)  as  its  Result. 
The  network  can  be  programmed  by  writing,  to  each  input  port,  one  such  number  program 
pair  for  each  module  in  the  network.  Chips  are  interchangeable  simply  by  changing  the 
VAMP  numbers  transmitted  by  the  host. 

We  conclude  this  section  by  very  briefly  sketching  extensions  permitting  a  limited 
dynamically  programmed  communication  system,  and  a  limited  error  correcting  capability. 
Neither  of  these  are  required  for  the  Inverse  Dynamics  algorithm,  but  would  render  the 
device  more  useful. 

The  communication  requirements  of  parallel  algorithms  are  often  mostly  local,  with  a 
few  long-distance  data  transfers  which  must  also  be  supported.  This  is  the  case  in  the 
Inverse  Dynamics  as  well.  The  Mostly  Local  Bus  (MLB)  in  Figure  VII.4  is  intended  to 
support  both  high  local  band-width  and  sparse  but  necessary  long-distance  communication. 
It  can  be  used  by  the  host  to  gain  programmatic  control  of  the  communication  network 
implemented  by  the  array  of  processors.  Programmatic  host  control  of  the  implemented 
communication  structure  (the  data  dependency  graph),  together  with  host  control  of  the 
operation  performed  at  each  node,  allows  the  same  physical  device  to  efficiently  implement 


many  different  algorithms  without  physical  reconfiguration  or  rewiring. 

Refer  to  Figure  VII. 4  (MLB).  It  depicts  a  multi  tiered  bus,  with  some  of  the  tiers  composed 
of  many  short  busses,  some  tiers  composed  of  several  medium-length  busses,  and  some  tiers 
composed  of  a  very  few  long  busses.  Conceptually,  imagine  that  each  module  is  potentially 
connected  to  each  bus  directly  in  front  of  and  back  of  it.  (In  any  real  implementation,  small 
groups  of  modules  would  actually  share  long-distance  drivers  to  keep  the  area  penalty  low 
for  sparse  long-distance  traffic.)  A  communication  network  structure  can  be  imposed  by 
specifying  when  each  module  reads  or  writes  which  bus.  By  insuring  that  the  destination 
module  reads  the  same  bus  of  the  same  tier  at  the  same  time  the  source  module  writes  to 
that  bus,  any  two  modules  which  connect  to  the  same  MLB  can  communicate.  Different 
source/destination  pairs  can  share  the  same  bus  provided  the  communications  occur  at 
different  times  in  the  basic  time  cycle  of  the  algorithm.  Within  the  framework  we  have 
developed  this  means  adjoining,  to  the  registers  governing  at  what  time  busses  are  read 
and  written,  additional  registers  governing  which  tier  is  targetted  These  registers  would  be 
loaded  exactly  as  already  described. 

In  those  algorithms  susceptible  to  this  architecture,  most  communciation  will  be  local 
and  can  occur  on  the  many  short- interval  busi  es,  achieving  high  local  bandwidth.  The  few 
long  distance  lines  exist  to  support  sparse  necessary  long-distance  communication,  but  if 
used  too  heavily  the  system  performance  degrades.  This  degradation  can  be  made  graceful, 
however,  by  inserting  extra  length  into  the  basic  cycle.  This  creates  extra  slots  into  which 
conflicting  communications  could  be  transferred.  Note  that  this  communciation  structure  is 
most  suitable  for  algorithms  having  a  straight-line  systolic  nature,  such  as  characterizes  the 
Inverse  Dynamics.  In  cases  where  it  is  applicable,  it  permits  relay-free  communication  with 
limited  pin-out  or  bus  connections,  and  would  be  most  suited  to  WSI  or  systems  with  large 
numbers  of  processors.  The  relay-free  character  of  the  communication  differs  from  schemes 
based  on  twisted  n-cubes,  which  may  require  up  to  0()ogn)  relay  delays  between  any  two 
processors. 

It  would  be  possible  to  "tune"  the  MLB  to  the  communication  requirements  of  a  particular 
algorithm  (while  preserving  the  capability  to  dynamically  reconfigure  so  as  to  perform  others) 
by  adjusting  the  ratios  of  bus  lengths  in  successive  tiers.  For  illustrative  purposes,  Figure 


VII.4  shows  the  number  of  processors  served  by  each  bus  doubting  at  every  second  tier,  with 
the  even  and  odd  tiers  offset  from  each  other  so  that  each  processor  is  roughly  symmetric 
in  communicative  power  each  direction.  The  number  of  tiers  thus  grows  as  O(log^)  of  the 
number  of  processors.  To  tune  the  MLB  to  a  particular  algorithm,  one  would  require  that 
the  number  of  busses  of  a  given  length  was  roughly  proportional  to  the  fraction  of  messages 
communicated  a  distance  of  approximately  that  length.  This  rough  guide  must  be  refined 
further  if  the  average  message  length  varies  with  distance. 

Finally,  we  note  that  most  of  the  machinery  necessary  to  support  a  limited  error  correcting 
capability  has  been  developed.  This  will  permit  several  cases  of  gross  individual  chip  failures 
to  be  caught  and  flagged,  with  the  device  automatically  resuming  correct  computation 
following  error  correction.  There  are  hardware  error  classes  for  which  this  capability  will 
not  apply,  of  course,  some  examples  being  a  direct  short  between  power  and  ground  (it  is 
difficult  to  automatically  recover  from  this  in  any  case),  occassional  intermittent  faults,  or 
failures  in  the  error-correcting  circuitry  itself.  Since  the  error- correcting  circuitry  is  a  small 
fraction  of  the  total  chip  circuitry,  however,  the  probability  that  it  will  fail  first  is  likely  also 
to  be  small. 

Order  the  VAMPs,  so  that  each  has  a  unique  successor  and  a  unique  predecessor 
except  the  two  end  VAMPs  Basically  we  suggest  a  mechanism  in  which  each  VAMP  checks 
its  successor  after  having  been  checked  by  its  predecessor.  It  thereafter  plays  the  role  of 
its  successor  in  the  computation  while  its  successor  checks  the  next  VAMP,  and  so  forth. 
After  each  VAMP  has  been  checked  they  return  to  a  normal  configuration,  and  repeat  the 
process.  If  a  faulty  computation  should  be  detected  in  the  checking  process,  the  faulty 
VAMP  will  be  inhibited  by  its  predecessor  and  excised  from  the  computational  structure  of 
the  device  Its  predecessor  will  continue  to  play  its  role  in  the  computation  until  the  faulty 
component  can  be  replaced. 

In  addition  to  those  discussed  above,  each  VAMP  would  also  have  a  second  set  of 
inputs  Operand-A',  Operand-B',  inputs  for  Test-in,  Test-back-in  and  Inhibit-in,  outputs 
for  Test-out,  Test-back-out  and  Inhibit-out,  and  input/output  Result'.  Each  VAMP  s 
Operand-A'  and  Operand-B'  inputs  are  connected  to  the  Operand-A  and  Operand-B 
inputs,  and  its  Result'  input/output  connected  to  the  Result  output,  of  its  auccesfor.  Its  Test- 
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in  and  lnhibit-in  inputs  are  connected  to  the  Test-out  and  Inhibit-out  of  its  predecessor.  Its 
Test-back-in  input  is  connected  to  the  Test-back-out  output  of  its  successor.  See  Figure 
VII. 5. 

These  new  inputs  and  outputs  are  used  as  follows.  During  normal  operation  of  a 
VAMP,  Test-in,  Test-back-in,  lnhibit-in,  Test-out,  Test-back-out  and  Inhibit-out  will 
all  be  False.  The  VAMP  will  load  as  its  operands  the  inputs  Operand-A  and  Operand-B, 
and  dump  its  result  on  the  output  Result.  The  input/output  Result'  will  be  tri-stated. 
Eventually,  the  VAMP's  predecessor  will  bring  Test-in  to  True  at  the  start  of  a  cycle.  The 
predecessor  is  now  filling  the  VAMP's  computational  role,  and  the  VAMP  is  free  to  check  its 
successor.  It  does  this  by  loading  its  operands  for  that  cycle  from  the  inputs  Operand-A' 
and  Operand-B',  performing  the  same  calculation  as  its  successor,  and  comparing  its  result 
to  its  successor's  output,  which  is  loaded  through  the  VAMP's  input/output  Result'.  (To 
do  this,  of  course,  it  must  have  a  second  set  of  control  registers,  correctly  loaded  to  match 
those  of  its  successor.)  The  output  Result  is  tri  stated. 

If  the  two  results  do  not  match,  then  the  successor  is  assumed  to  be  faulty  (recall  that 
the  VAMP  itself  has  just  been  checked  in  the  immediately  preceding  cycle).  The  VAMP 
excises  its  successor  from  the  computation  by  bringing  Inhibit-out  to  True  This  causes 
its  successor  to  tri-state  its  data  outputs  and  bring  its  conti  ol  outputs  False,  which  is 
implemented  by  very  simple  gate  logic  at  each  pad  driven  directly  from  lnhibit-in.  The 
VAMP  will  continue  to  fill  its  successor's  role  in  the  computation  until  the  device  is  brought 
down  for  maintenance.  At  that  time  it  can  broadcast  its  VAMP-number  through  the  net  in 
Maintenance  Mode  as  a  fault  finder,  somewhat  akin  to  the  way  programs  were  broadcast 
in  Program  Mode,  except  that  instead  of  outputting  Operand  A  unchanged  each  VAMP 
outputs  any  Operand  which  contains  a  fault  notification.  The  VAMP-numbers  corresponding 
to  fault  finders  will  therefore  emerge  from  the  net  where  the  motor  torques  emerge  in  Data 
Mode,  and  the  faulty  VAMPs  can  be  replaced. 

If  the  two  results  match,  then  the  successor  is  assumed  to  be  not  faulty.  The  VAMP  will 
bring  Test-out  to  True  at  the  start  of  the  next  cycle,  causing  its  successor  to  repeat  the 
operation  just  described.  Since  the  VAMP  must  now  fill  its  successor's  role,  it  will  continue 
to  take  as  its  operands  the  inputs  Operand-A'  and  Operand-B'.  but  in  addition  will  now 
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dump  its  resutt  on  the  input/output  Result'.  The  output  Result  remains  tri  stated. 

The  VAMP  will  return  to  normal  operation  after  its  input  Test-back-in  (connected  to 
its  successor's  output  Test-back-out)  is  brought  True  for  one  cycle  The  input/output 
Result'  is  tri  stated,  and  its  output  Test- back-out  is  brought  True  for  one  cycle  (causing 
its  predecessor  to  resume  normal  operation  in  the  same  sequence).  On  the  next  cycle  the 
VAMP  will  again  load  as  operands  the  inputs  Operand-A  and  Operand-B,  and  dump  its 
result  on  the  output  Result. 

So  far  we  have  described  the  action  of  VAMPs  in  the  middle  of  the  order.  It  remains 
to  describe  the  ends.  The  last  VAMP  can  simply  have  its  output  Test-out  fed  back  into  its 
input  1  est-back-in.  This  causes  the  sequence  to  reverse  when  the  last  VAMP  is  reached. 
To  check  the  first  VAMP  we  must  introduce  an  extraneous  VAMP  (and  properly  speaking,  a 
second  extraneous  VAMP  to  check  that  one).  The  first  extraneous  VAMP  can  simply  have 
its  output  Test- back-out  fed  back  into  its  input  Test-in. 


Table  VII. 1  —  Primitive  Module  Operation  Sequencing 
(refer  to  Figure  VII.  1) 

(Tina  primitive  module  calculates  the  3—  scalar  component  of  the  result  vector) 


Operation  Flops 

Dest. 

Source 

VA:  1 

BUS-1 

REG:  OP-A-3 

BUS-2 

<= 

REG:  OP-B-3 

ADDER 

*= 

BUS-1.  BUS-2 

(result  tn  adder/subtractor) 

SV:  1 

BUS-1 

REG:  OP-A-1 

BUS-2 

REG:  OP  B-3 

MULT-1 

<= 

BUS-1,  BUS  2 

(scalar  multiplier  in  1—  scalar  component  of  Operand  A,  by  convention ) 

(result  in  multiplier- 1) 

VD,  MV,  MM:  1 

BUS-1 

REG:  OP-A  3 

BUS- 2 

REG:  OP-B-3 

MULT-1 

BUS  1.  BUS-2 

MULT-2 

<= 

REGS:  OP  A  2,  OP  B-2 

2 

BUS-1 

<= 

MULT-1 

BUS-2 

<= 

MULT  2 

ADDER 

i= 

BUS-1,  BUS-2 

MULT-1 

t- 

REGS'  CD  A  1  OP  B  1 

3 

BUS-1 

<= 

MULT-1 

BUS-2 

ADDER 

ADDER 

BUS-1,  BUS-2 

(alternatively,  adder  can  accumulate  instead  of  being  totally  bus-driven) 

(result  in  multiplier-1) 

VC:  1 

BUS-1 

<= 

REG:  OP-B-2 

BUS-2 

REG:  OP-B  1 

MULT-1 

<= 

BUS-1,  REG:  OP-A-1 

MULT-2 

<= 

BUS  2,  REG:  OP-A-2 

2 

BUS-1 

<= 

MULT-1 

BUS-2 

MULT-2 

SUBTRACTOR 

*= 

BUS-1,  BUS-2 

(subtractor  computes  BUS-1  —  BUS-B) 
(result  in  adder /subtractor) 
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CONTROL  STRUCTURE 

IN  OPERATION,  CONTROL  NUST  GOVERN: 

-  1)  When  operand  R  is  read  in,  &  to  which  regs; 

2)  When  operand  B  is  read  in,  &  to  which  regs; 

3)  When  the  Result  is  output,  &  from  which  reas; 

4)  What  operation  is  performed. 
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Figure  VII. 3  —  VAMP  Control  Registers 
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Figure  VII. 5a  —  Non-Error-Correcting  Configuration 
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Figure  VII. 5b  —  Limited  Automatic  Error-Correction 


8.  SUGGESTIONS  FOR  FUTURE  EXTENSIONS 


We  very  briefly  allude  to,  without  discussing  in  depth,  a  few  possible  extensions  to  this  work. 
If  the  computation  can  be  made  to  run  at  a  rate  which  is  only  I/O  bound,  it  becomes  feasible 
to  consider  “active  memory”  (or  an  I/O  device)  which  provides  the  calculated  torques  as 
soon  as  the  desired  motion  has  been  loaded  in  appropriate  locations.  Thus  it  may  be  possible 
to  build  an  on-line  "optimizing  trajectory  compiler"  in  which  the  desired  motion  (trajectory) 
for  the  next  several  time  periods  is  pre-planned,  the  motor  torques  automatically  generated, 
and  the  time  sequence  of  necessary  motor  torques  inspected  slightly  btjore  the  manipulator 
has  actually  arrived  at  the  trajectory  points.  If  the  motor  torques  are  excessive  (motor  or  joint 
damage)  or  below  the  rated  maximum  (faster  motion  possible),  the  proposed  trajectory  could 
be  modified  accordingly,  on  line.  Also,  Torre  and  Poggio  have  a  result  indicating  that  neural 
structures  could  perform  an  arithmetic  multiplication  in  about  a  millisecond.  While  certainly 
not  arguing  that  in  fact  it  is  don«.  that  way  in  the  brain,  observe  that  in  principle  it  would 
be  possible  to  compute  the  Inverse  Dynamics  in  approximately  real  time  using  a  suitable 
neural  structure  We  close  with  a  few  remarks  concerning  the  possibility  of  generalizing  the 
_'(log(n))  embedding  to  other  recursive  structures 

The  question  of  how  to  incorporate  dynamical  considerations  into  on-line  trajectory 
planning  is  an  area  ot  open  research  (7J.  Hollerbach(15J  shows  how  to  uniformly  scale  the 
velocity  of  a  trajectory  so  as  to  remain  within  torque  limits.  This  applies  only  to  uniform 
velocity  scaling,  however,  and  we  might  like  to  be  able  to  change  the  path  of  the  trajectory 
or  scale  velocity  in  a  non-uniform  way.  Bobrow,  Dubowsky  and  Gibson[6]  solve  the  general 
case  of  time-optimal  control  along  a  specified  trajectory,  but  do  not  allow  the  path  to  be 
varied  in  space  and  require  moderately  intensive  computation  by  the  host. 

Consider  Figure  VIII. 1,  which  might  be  taken  to  represent  a  shift  register  of  depth  m. 
Data  records  pushed  in  at  the  top  progresses  through  the  shift  register  with  timestep  X  to 
emerge  out  the  bottom  time  m\  later. 

Imagine  that  the  shift  register  was  interfaced  to  a  memory  board  in  such  a  way  as  to 
‘‘look  like"  memory  to  the  host,  so  that  any  of  the  locations  could  be  read  or  written  as 
memory.  At  each  timestep  X  the  0—  row  would  be  shifted  out  the  bottom,  the  whole  array 
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shifted  down  one,  and  the  host  would  write  a  new  set  of  values  at  the  top.  At  any  time,  of 
course,  the  host  could  read  or  write  any  of  the  array  locations. 

Now  imagine  that  some  of  the  array  locations  in  the  j—  row  correspond  to  “input 
values"  ({9(jX),9(jX),9(;'X)})  and  some  locations  correspond  to  "motor  torques”  ({r(jX)}); 
and  that  the  "torque"  locations  are  really  hard  wired  (through  such  a  dynamics  box  as  we 
have  described)  to  the  "input  value"  locations.  The  box  continuously  computes,  for  each 
set  of  input  values  in  the  shift  register,  the  corresponding  torques  —  if  the  host  changes 
an  input  value  anywhere  in  the  array,  the  torques  corresponding  to  the  new  set  of  values 
automagically  appear.  Set  X  equal  to  the  refresh  rate  at  which  new  motor  torques  are 
supplied  to  the  manipulator.  Now  as  each  value  is  shifted  out  the  bottom,  imagine  that 
a  demon  catches  it  and  passes  the  torques  to  the  manipulator  motors  —  simultaneously 
another  demon  fills  in  a  new  set  of  input  values  at  the  top  and  the  torques  appear.  We  might 
as  well  have  the  inertial  Cartesian  coordinates  and  velocities  of  each  link  endpoint  appear 
too,  since  they  are  manifestly  simpler  to  calculate  from  the  same  input  data  as  the  inverse 
dynamics,  the  fractional  cost  to  include  them  is  small. 

This  now  becomes  a  fairly  explicit  representation  of  many  interesting  characteristics  of 
the  manipulator  trajectory  for  the  next  m  time  periods  If  the  host  inspects  the  values  but 
makes  no  changes,  those  m  values  will  be  shifted  to  the  manipulator,  one  by  one;  and  that 
will  define  the  path  the  manipulator  will  follow  for  the  next  m  periods.  Alternatively  the 
host  may  change  one  or  more  values  causing  the  corresponding  changes  in  the  torques; 
the  manipulator  would  then  follow  the  revised  course.  This  arrangement  might  be  useful 
in  helping  to  solve  the  problem  of  how  to  incorporate  dynamical  considerations  into  on-line 
trajectory  planning,  acting  to  help  optimize  a  crude  trajectory  generated  by  a  higher-level 
planner. 

One  extension  to  this  basic  idea  would  be  to  use  our  box  to  also  calculate  the  input 
value  joint  velocities  and  positions  ({?(£),  ?(<)})  directly  from  the  acceleration  profile  ({?(£)}), 
rather  than  having  them  set  directly  by  the  host.  This  would  avoid  the  embarassing  possibility 
of  (e  g.)  a  trajectory  requiring  an  instantaneous  step  discontinuity  in  manipulator  position, 
as  well  as  reduce  computational  demands  on  the  host.  Another  extension  would  be  to  have 
several  such  shift  registers;  one  could  thereby  sweep  out  an  envelope  around  the  proposed 


trajectory,  exploring  in  parallel  possible  futures  and  interpolating  between  them. 

Another  interesting  area  is  the  overlap  with  human  psychology  and  neurophysiology. 
Torre  and  Poggio[4l],[22]  have  shown  the  theoretical  possibility  that  a  neuron  could  perform 
an  analog  multiplication  in  its  dendritic  branches  within  about  a  millisecond  (this  capability 
was  originally  postulated  to  be  necessary  in  order  to  explain  certain  aspects  of  visual 
processing).  The  analysis  treats  the  dendritic  branches  according  to  membrane  theory  in 
passive  RC  cables.  Where  g<  and  g  .  are  inputs  they  are  able  to  produce  a  term  proportional 
to  (gt  —  ag\g<).  By  additionally  connecting  the  input  appropriately  to  a  side  branch  on 
the  dendritic  pathway  to  the  axon  it  is  possible  to  show  theoretical  cancellation  of  the  linear 
term.  Analog  additions  may  be  performed  as  in  classical  circuit  theory  (given  appropriate 
arrangements  of  the  dendrites). 

Thus,  given  the  time  bounds  on  the  formulations  developed  above,  the  nervous  system 
might  be  capable  of  performing  the  inverse  dynamics  calculation  in  something  approximating 
real  time.  Since  the  computations  are  performed  in  analog  by  biological  components  one 
necessarily  expects  them  to  be  "dirty  ',  i.e.  contaminated  with  noise,  inaccuracies,  and 
other  errors.  If  one  postulates  a  large  number  of  such  inaccurate  devices  performing  the 
same  calculation  and  averaging  the  result,  however,  from  fundamental  statistical  properties 
one  may  show  that  the  resulting  calculation  can  be  made  arbitrarily  accurate  by  taking  the 
number  ot  devices  arbitrarily  large.  Also,  though  the  formalisms  were  developed  for  a  single 
chain  of  length  n,  the  fact  that  the  time  complexity  increases  only  as  0(log(n))  suggests  the 
possibility  that  one  might  control  other  large  systems  involving  many  degrees  of  freedom 
without  paying  an  exorbitant  penalty  in  real  elapsed  time.  Hollerbach  and  Flash[17]  have 
performed  experiments  investigating  the  possibility  that  human  subjects  perform  some  sort  of 
dynamic  scaling  in  planning  planar  arm  trajectories.  Many  associated  questions  immediately 
arise,  of  course,  such  as  how  the  nervous  system  learns  to  perform  the  computation;  or, 
if  it  is  hard  wired,  how  the  system  learns  the  parameters;  and  so  forth.  We  do  not  wish  to 
engage  in  this  debate,  but  only  to  point  out  that  the  computational  aspects  are  tractable. 

Finally,  it  seems  likely  that  the  process  of  embedding  a  serial  linear-time  recursive 
algorithm  in  a  parallel  logarithmic-time  algorithm  is  generalizable,  certainly  at  least  within  the 
context  of  associative  ring  operators.  We  saw  that  several  basic  properties  were  exploited 
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in  our  analysis  above,  including  the  associativity  of  the  ring  operators  and  the  capability  to 
order  the  recursive  variables  in  time  according  to  data  dependencies. 

The  basic  strategy  followed  in  Section  IV  was  to  expand  the  closed-form  non-recursive 
formula  for  into  an  equivalent  expression  involving  two  similar  formulae  for  and 
Ar*+1,().  This  was  taken  to  be  the  combining  form  for  In  making  the  expansion,  it  was 
also  necessary  to  expand  and  re  group  expressions  for  the  dependent  variables  in  terms  of 
their  combining  forms.  We  speculate  that  the  associativity  of  the  operators  permitted  the 
necessary  expansion  of  the  dependent  variables  Linearity  is  not  necessary,  as  it  is  possible 
to  devise  a  combining  form  for  the  (non-linear)  from  expressions  for  (^)  and  (^): 

(  i  y.  mb) 

\A  +  B  )  (-\)  f  (^j) 

-GWi> 

Thus  it  might  be  possible  to  devise  togarithmic-time  parallel  algorithms  from  linear-time  serial 
ones  even  if  scalar  division  is  involved,  and  perhaps  other  non-linear  (non-ring)  operators. 
Of  particular  interest  would  be  a  general  mechanism  for  logarithmic  embedding  of  linear 
algorithms  involving  matrix  inversion  (or  solutions  of  simultaneous  linear  equations,  e.g. 
Gaussian  elimination)  at  each  step.  This  would  render  a  much  wider  class  of  algorithms 
accessible  to  logarithmic-time  techniques,  e  g.  perhaps  the  inverse  kinematics  or  the  direct 
(integral)  dynamics. 

In  Appendix  A  it  is  noted  that  the  ability  to  order  the  recursive  variables  according  to 
data  dependencies 

X,  >Y,  >  Z,t  ... 

where  we  define 

■X",  >  Y,  iff  y,  does  not  depend  on  X,  for  any  j 
iff  dYJdX.  s  0 

guaranteed  minimal  satisfiability  of  the  relative  offset  inequalities.  We  speculate  that  this 
condition  might  also  be  necessary  to  form  logarithmic-time  combining  forms  as  weM.  This 
may  arise  since  in  devising  the  combining  forms  it  was  necessary  to  apply  previously  deduced 
combining  forms  to  break  apart  the  constituent  dependent  variables.  Lacking  this  condition, 


one  may  at  least  imagine  a  case  in  which  deriving  the  combining  form  for  X  ,,i  requires 
expanding  the  combining  form  for  while  deriving  the  combining  form  for  requires 
doing  the  same  for  X„.:.  The  ability  to  order  the  recursive  variables  by  data  dependencies 
insures  that  at  least  this  particular  deadlock  cannot  arise.  A  more  formal  demonstration  of 
the  applicability  of  this  condition  would  be  useful. 
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9  CONCLUSIONS 


We  have  shown  that  considerable  time  savings  accrues  from  embedding  the  inverse  dynamics 
calculation  in  a  parallel  computation.  A  parallel  time  algorithm  with  time  complexity  only 
logarithmic  in  the  number  of  joints  has  been  derived.  Hardware  necessary  to  implement 
such  parallel  algorithms  has  been  considered,  and  the  requirements  shown  to  be  substantial 
but  not  excessive.  Using  the  concepts  developed,  it  should  be  possible  to  devise  a 
device  capable  of  implementing  the  calculation  at  a  speed  primarily  bounded  by  the 
input/output  requirements  which  the  algorithm  imposes  on  the  host.  We  have  sketched 
speculative  extensions  to  this  work  in  the  areas  of  on-line  trajectory  planning,  psychology 
and  neurophysiology,  and  parallel  algorithm  theory. 
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Appendix  A  —  Derivation  of  the  Linear  Time  Offsets  and  C 

In  the  following,  as  in  Table  III.  1 ,  “Avail(X,  _i)  =  f"  means  that  variable  X’.-i  is  made 
available  to  the  «—  node  at  time  t  (on  the  backward  recursion;  substitute  X,+i  on  the 
forward  recursion).  The  abbreviations  V A  (vector  addition),  VC  (vector  cross  product),  etc., 
are  explained  in  Table  HI.1;  they  denote  the  time  required  to  perform  certain  matrix  or  vector 
operations. 

First  we  determine  the  relative  delays,  or  offsets,  in  variable  availability  times.  We  do 
this  by  requiring  the  following  implication,  for  each  propagated  recursive  variable: 

Avail(X  )  --  niax(y4tmi7(A',_i)  -f  a,  Avatl(Y,—i )  +  0,  Avatl(Z,—  \ )  -f  $,...)  +  7 
=>  Avail(X,)  =  Avatl(X,  —  i)  -f  a  -(-  7 

This  condition  amounts  to  saying  that  nothing  delays  the  availability  of  a  variable  longer  than 
itself.  This  allows  one  to  infer 

max(Atiaxf(A’1_1)  -j-  a,  Avail(Y,—i)  +  0,Ava\l(Z,-.\)  -f  6, . . .)  =  j4vflii(X,_i) 

from  which  immediately  follow  the  inequalities 

Avail(X,  —  i)  +  a  >  Avail(Y, _  1 )  +  0 
Avatl(X,—\)  4-  a  >  Avatl(Z,—i)  -f  6 


That  the  set  of  all  such  inferrable  inequalities  is  globally  satisfiable  follows  from  the  ability 
to  globally  order  the  recursion  variables 

X,  t  Y,  >  Z,  > 


where  we  define 

X,  >  Y,  iff  Y,  does  not  depend  on  X,  for  any  j 
iff  dYJdX,  ==  0. 

In  the  linear  Newton  Euler  recursion,  for  example,  we  have 

n,  >  f,  V  p,  >  w,  y  w, 

and  the  non-propagated  variables  could  be  included  in  the  chain  if  desired. 


By  considering  starting  conditions  (initiation  of  the  calculation)  one  can  generate  another 
set  of  inequalities  of  the  form 


i4i/a»7(Xi)  +  o'  >  AvatlfY,)  +  0' 
Avatl(Xi)  a'  >  Avatl(Zi)  +  S' 


Since  both  sets  of  inequalities  are  satisfiable  it  is  possible  to  reach  a  point  of  minimum 
satisfaction  (how  dreary.  .  why  would  one  wish  to  do  so?).  This  is  the  unique  point  (Avatl(X, ), 
,4vo»/(V|),  Avail(Z\), .  .)  which  satisfies  both  sets  of  inequalities  above,  and  also  minimizes 
Avatl(X i)  for  each  variable  X.  This  defines  the  relative  offsets  which  will  minimize  the  total 
computational  time. 

On  the  backward  Newton-Euler  recursion  only  w,,  w,,  and  p.  need  be  considered.  This 
is  because  r,,  F,,  and  A’  are  merely  passed  directly  to  forward  recursion  nodes.  From  Table 
111.1, 

Avail(uit)  —  max(Avotf(w,_i)  +  VC  +  V A,  Ava*7(w,_i))  +  MV  -f-  V A 
Thus  for  minimum  delay  (MV  +  VA)  in  propagating  u,  we  require  that 

max{Avo«7(w,_i)  -f  VC  +  V  A,  Atia«7(w,_t ))  =  At>at7(cv,_i), 


hence 

Avot7(w,_, )  >  Avo»7(u>,_, )  +  VC  +  VA  (*) 


Similarly, 

A\iatl(p,)  =  max(Auai7(u,)  +  2  VC  +  VA, 

Avail(u),)  +  VC  +  VA, 

Avai7(p,_,)  +  MV)  4-  VA 
=  max(Ata«7(w— ,)  +  MV  -f  2 VC  +  2VA, 

(max(Ava»/(w,_, ),  Avo»7(wt_, )  -f  VC  +  VA)  +  MV  +  VA)  +  VC  +  VA, 
Avo»i(pt_,)-|-  Af  V)  -f  VA 
=  max(Avaii(w,_|) -F  MV  -f  2VC  -+■  3VA, 

Ava«7(w,_|)  f  MV  +  VC  +  2VA, 

Avo»t(p,_j)  +  MV)  •+■  VA 
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And  so  the  minimal  delay  in  propagating  p,  requires  both 

Atmtl(p,_,)  >  Avoi/(w,_i)  +  2 VC  +  3VA 

Avatl(p,_,)  >  Avail(uj,._,)  -f  VC  +  2 VA.  1 

If  equality  holds  in  (*)  then  these  two  bounds  are  equivalent. 

On  the  forward  recursion  only  /,  and  n,  need  be  considered. 

Ava«/(n,)  =  max(Ava«/(/,  +  i)  +  VC  -f-  VA,  At/a«7(n,_H ))  +  MV  -F  VA 

and  so 

Avatl(n,.+  I)  >  Avail(f,  +  i )  -f  VC  A  VA.  (***) 

Next  we  determine  the  constant  <T  by  showing  when  r, .  the  last  generalized  joint  force 
of  the  forward  recursion,  becomes  available  as  output.  From  Table  (It.  1 ,  assuming  all  input 
values  become  available  simultaneously  at  time  t  ---  0, 

Avail(ui )  =  MV  +  VA 
Avail(  ij,)  —  MV  +  VC  +  2  VA 

Avai/(jj,)  —  max(Avaif(cj1)  +  VC  +  V A,  Avail(u)t )  -f  2 V C  VA,  MV)  +  V A) 

=  MV  +  2VC  +  4VA 

Note  that  these  satisfy  (*)  and  (**)  so  the  propagation  time  is  (MV  -f  VA)  per  node. 
Avaxi(u„)  =  (n  —  \)(MV  -f-  VA)  -f  Avait( W|) 

-=  (n  -  1)(MV  f  VA)  -f  MV  -f  VA 

Aua«7(w„)  =  (n  -  1)(.WV  +  VA)  +  MV  +  VC  +  2 VA 

Avatl(pJ  =  [n  —  1)(M  V  +  V  A)  ~F  Af  F  4-  2  VC  4VA 

Avail(r„)  —  rnax(Avai7(pM),  Avai7(w„)  -F  VC  -F  VA,  Avatl(u/„)  -F  2VC  -f-  VA)  -f  V A 
=  (n  -  1){MV  +  VA)  +  MV  +  2 VC  +  5VA 
At>at7(Fn)  =  At>a«7(r„)  SV 

=  (n  -  1)(MV  -F  VA)  +  SV  M V  +  2VC  +  5VA 

Avail(N„)  =  max(Auai7(wn)  +  VC,  Avat7(w„))  -(-  MV  -F  VA 
=  (n  -  1)(MV  +  VA)  +  2M  V  +  VC  +  3VA. 

Thereafter,  on  the  forward  recursion,  (recognizing  that  Ava«7(/n+1)  =  Ava«f(nn+1)  =  0), 

Avatl(f„)  =  Avail(Fn)  -F  VA 

=  (n  -  1)(MV  +  VA)  +  SV  +  MV  +  2 VC  +  6VA 
Avoif(n„)  =  rnax(Avail(F„)  -F  VC,  Ava»7(Af„))  +  3VA 
=  (n  -  1)(MV  -F  VA)  +  MV  +  VC  +  6 VA 
-F  max(SV  +  2 VC  +  2VA,  MV) 

=  (n  -  l)(AfV  -F  VA)  +  SV  +  MV  +  3VC  +  8VA 
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These  two  expressions  satisfy  (***)  so  propagation  occurs  at  the  maximum  rate  of  (MV  +  V  A) 
per  node. 

Avail(n\)  =  2 (n  -  1){MV  +  VA)  +  SV  +  MV  +  3 VC  +  8 VA 
AvcliI(t  .)<  max(i4vaif(/ 1 ),  vtvajf(ni )) 

=  2{n  -  1 )( A/ V  +  VA)  +  SV  +  MV  -+-  3VC  +  8 VA 

(Actually,  Avatar,)  will  depend  on  whether  joint  t  is  translational  (=  At>at/(/,))  or  rotational 
(^-  Avail(nt)),  but  we  assume  here  rotational,  the  worst  case.) 

Assuming  a  maximally  parallel  implementation,  we  would  have: 


VA=  1 

Addn 

(using  3  adders) 

SV  —  1 

M  ult 

(using  3  multipliers) 

VC=  1 

Mult  4  1  Addn 

(using  6  multipliers  and  3  adders) 

MV  —  1 

Mult  +  2  Addns 

(using  9  multipliers  and  3  adders). 

So 


Avai((ri)  <  (2n  +  3)  Mults  -f-  (6n  ■+  7)  Addns 


Appendix  B  —  Derivation  of  the  Logarithmic  Recursive  Formulae 

NEWTON-EULER  BACKWARD  RECURSION  VARIABLES: 

The  derivation  of  the  logarithmic  combining  form  for  w,  has  been  developed  in  the  text. 
Next,  we  show  that  w,  satisfies  the  following  closed-form  formula: 

i 

o 

as  it  is  a  fixed  point  of  the  recursive  formula  for  w,  in  Table  1.1: 

=  A’[  5Z  WJ,,-l°AgJ~l'9}  +  “^-1  X 
Vj— 0 

+w,_i  X  *,-19,)^ 

=  +  W,_,  X  Z,-iq,)j 

As  in  the  case  of  w,,  we  take  w0  =  i  )90 .  Most  applications  of  interest  will  have 
=  9o  =  °- 

in  oraer  to  matcn  at  a  =  o  and  6  = », 

6 

w«,k  =  22  x 

j—a 

k 

=  22  (WJ  kW<i'+>).i>)Taj(*J-''ij  +  w».U-«)  X 

b 

+  22  +  wi*+i).u-i))  x 

6 

=  "+■  '*'(*+•  ),l>  +  —  I)w”.*)  X  *)  — 1$;) 

—  +  (Wj*+i  ),*«■,»)  X  W(i  +  1),6  -f  W{k  +  l),b 

which  is  the  combining  form  for  wa,». 
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To  derive  the  combining  form  for  pah,  it  is  necessary  to  create  the  auxiliary  variables 


Qu.h  =  y 

=  Wj+l.tQa.t  +  Q(*+l).k 

s  y 

j~*n 

=  ^1+ l.fcfti.t  +  R(t+,).k 

b 

s„.,.  =  Y  W]+\A w-.j  x  pJ) 

;=«n 

5Z  W'  +  X'i.((Wl+U}w.,.i'  -f  w*  +  t.j)  X  p;^ 
j=*+i  v  7 

=  +  (^Z+i„>,w<i.r-)  X  1 ).h  -1- 

Next  we  show  that  p,  satisfies  the  closed-form  expression 

P.  =  W^,p0  +  Y  Wf+ufa  x  pj  +  <*0  x  (wj  x  pj) 

j— i  ' 

+  a,(A;T +  2Wj  X  Aj' 


ThiQ  a  fiYArl  nr>int  of  tho  roMirciwo  formula  frtr  ^  ae  ehoiwn 

*  -  -  '  r-  -  ••  -  - - -  - r I  — • 


Pr 


<(^r„-.po  +  Y  x  P; + x  (w;  x  p*) 

+  SA*J +  2w,  X  Ajzj-xqjijj 
+  X  p‘  +  w,  X  (w,  X  p*) 


+  j?,  -+-  2w,  X 

A,7"?,-,  f  w,  X  p*  -f  w,  X  (w,  X  p*) 
+  9,(Aj +  2w,  XAj" 


where  p0  is  the  acceleration  of  the  base.  Typically  this  is  the  acceleration  due  to  gravity  at 
the  site.  If  one  took  u0  ^  0  above  then  p0  may  also  include  a  term  for  w0  x  (u>0  x  pj).  where 


P,‘  is  a  vector  from  the  Earth's  center  to  the  site;  this  accounts  for  the  centripetal 
acceleration  arising  from  the  rotation  of  the  Earth.  If  one  accounts  for  the  gravitational 
acceleration  ( g )  by  taking  p^  --  g  at  the  base,  else  p?  —  0  for  i  /-  0,  then  the  formula  may  be 
equivalently  re  written  with  greater  clarity  (covering  both  cases  u>0  —  0  and  uia  ^  0)  as 


p, 


i2  +  +  "j  x  P*  +  uj  x  K  x  p‘) 

+  +  2wj  X  Aj z,-xq} 


This  we  will  take  to  be  the  defining  closed-form  non  recursive  formula  for  p,.  The  term 
involving  pj  is  a  technical  artifice  to  account  for  p()  cleanly,  and  will  vanish  in  the  combining 
form. 


The  demonstration  of  the  combining  form  of  p„>(,  will  require  the  vector  identity  a  x  (6  x 
c)  -f  (6  x  a)  x  c  =  b  x  (a  x  c).  In  order  to  match  at  a  =  0  and  b  —  i,  take 


03 


P«>  -  E  Wj  +  tJ#!  +  Wu.,  X  p'  f  X  (w,,.,  X  p‘)  +  9;(Aj  2,-1^  i-  2w(llJ  X  A'  Zj-tqA 

j=a  '  ' 

=  H  X  p*  +  wu-J  X  X  pj) 

.7**“  ' 

+  +  2wq,;  X  A] 

+■  5Z  *^J+1  .i/pj  +  +  (WT-t-i.j1*'.,.*)  X  Wihi),;  +  +  X  pj 

J=A+1  \  '  ' 

+  (^2 +  u,lk-r').j)  X  ^(Wi  +  I.^a.k  +  <*,(M-I).j)  X  p^ 

+  Zj  —  iflj  -f  2(Wj  +  l  -4-w*  +  i,j)x 

b 

“  ^A-H.bPu.*  P(M-I),b  X  E  ^J+l.bPj 

+  x  f(w[+1.,wa.,)  x  £  ^7+1,.P; 

V  j=*+l 

■+  2  XI  (^Vi^im-d.j  X  p'j  +  ffjWj+i'^Ajzj- lqJ)\\ 
j  =  *-hv  // 

Pn.f  —  I  .bPa.Ir)  ~t~  P(*  +  )).b  "f"  X  ft|*-fl),(> 

/  N 

+  X  ^(W^,  ,,w„.a)  X  +  2(5|t+i) ,,  +  $M-i.b)J 

NEWTON-EULER  FORWARD  RECURSION  VARIABLES 

As  noted  in  the  text,  on  the  Newton-Euler  forward  recursion  the  coordinate  matrix 
products  of  interest  will  be  W„+i,a_m  instead  of  Wj+1 1. 

Noting  that  the  numbering  runs  backward,  we  see  that  /,  satisfies  the  non  recursive 


formula 


/,  =  £w,+1jf; 

;  — i 

n 

—  A1  +  i 
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T o  match  at  a  —  i  and  b  =  «, 


/o.fc  =  + 

A-  6 

=  53  Wa  +  i.JFJ  +  Wa  +  t.t+l  ^  tV*+2.,F, 

j»Q  J=*+l 

=  fa.k  +  Wa  +  )'k+lfl*+l).h 

If  desired,  forces  and  torques  applied  by  the  environment  to  the  manipulator  tip  may 
be  incorporated  in  a  fashion  similar  to  incorporating  the  acceleration  of  the  base  in  the 
discussion  of  p, This  will  not  change  the  displayed  combining  forms. 

Similarly,  n,  satisfies  the  non  recursive  formula 

«.  =  Yi  W‘  +  *-j(Nj  +  *j  x  ^  +  P*  x  (A, +  > /;  +  «)) 

~  Af|  +  a,  X  F,  +  P,  X 

+  A  +  i  +  p;  *  Mj-f 

=  TV,  -f  3,  X  F,  +p,  X  (A-m/p-h)  4- 

To  match  at  a  =  i  and  6  =  n  we  must  have 

f* 

n<i,6  ~  ^  VVa_^.  i  ,j  4“  ^  X  F^  4  Pj  X  (-Aj-f-  i/o-f 

j"*a 

—  i/V,^  X^+P,#  X  I  .A:  +  ^0+2.*  -H  /fc-f-  1  .b))^ 

j  — a  ^ 

b 

-h  VVa+j,*_f.a  53  W*4-2,/(ty  4  x  4  P,  X  (-Aj  +  l/u  +  l),b)J 

2-*  +  l 

—  ”<>,*  4  WoH-j,*+ir»*+i,6  +  5T  x 

j  *a  '  ' 

—  n0t*  4  ((-4*+lF?,,*)  X  /*+i,6  4-  «*  +  i,b) 
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Appendix  C  —  Derivation  of  the  Logarithmic  Time  Offsets  and  C 

These  may  be  derived  from  Table  V.1  by  inspection,  by  applying  the  rule  that  to  insure 
minimum  delay,  the  maximum  delay  of  any  variable  must  be  caused  by  a  data  dependency 
on  that  variable. 

Auaxl(uij,y)  >  Avatl(Wi'V) 

At>ati(w,,v)  >  AvaiHWj.y)  -f-  VC  +  VA 
Avatl(ujJ%y)  >  Avail[u>s  y)  -(-  VC  -f-  V A 
Avatl(Rj ,u)  >  Avail(Wj_y) 

Avail(Sr,u)  >  ^4t>at7(iyi.!()  -f-  VC  4-  VA 
At>a»l(S.,,ti)  >  Avail(uis  y)  -f  VC  +  VA 
Avail(S,.y)  -t-  MV'  >  Avail(R,.y)  4-  VC  4-  VA 
Avail(Qx.y)  >  AvaiHWj.y) 

Aval l(p ,  y)  >  i4vat/(IV,.y)  +  2VC  +  3VA 
Avail('ps  v)  >  Avail(uij  y)  4-2 VC  -f-  3VA 
Avait(pS  y)  >  Ava«7(wJ-v)  4-  VC  +  2VA 
Avail(pI  y)  +  MV  >  AvaW(ft,.y)  -f-  2VC  +  3VA 
Avail!#,  y)  +  MV  >  Avatl(S,,y)  +  VC  +  4VA  +  SV 
Availfr^y)  +WV>  Avail(QT,y)  +  VC  -f  4VA  +  SV 
Avail(pJ  V)  +MV>  Avatl(Wx,y)  +  2VC  4-  5VA  +  SV 

Avoiifn,.,,)  >  AvatUfx,y)  +  VC  +  VA 

The  delay  conditions  established,  actual  timing  can  be  generated.  From  Table  IV.1  we 
extract  the  o  =  6  case;  Table  V.1  covers  a  /  6. 

Avaxl{Wa,a)  =  0 
AvaiJ(wa_a)  -  MV 

Avail(  ua.a)  =  MV  (*) 

Avmt(Qa,a)  =  MV 
Avail(J?0i„)  =  0 

Avail(Sa,a)  =  MV  +  VC.  (*) 

However,  the  equations  marked  (*)  fail  to  satisfy  the  delay  conditions  and  so  must  be  revised 
to 

Avo»7(w0,0)  =  MV  +  VC  +  VA 
Avo«7(S0.0)  =  M  V  +  VC  -I-  VA. 
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(*) 

(/rom  (*)  above) 


Analysis  of  Avatl(paa)  is  less  obvious,  but  proceeds  as  follows  (assuming  VC  >  SV  and 
VC  >  VA) 


Avail(pl)  =  0 
Avail^Alz,  qa)=MV 
Avatl('p9a  +  ffaAlza-i4a)  =  MV  -I-  VA 
Avail[ua.a  X  [ua.a  X  p*))  =  MV  +  2 VC 
Avat/(p®  -|-  da  -A  J  —  1 9,1  +  wa,a  X  (u>0,0  X  p*))  =  MV  +  2VC  +  VA 

Avail(Cja,a  X  p'a)  =  MV  +  VC 
Avatl(2uia.a  X  Qa.u)  =  MV  +  SV  -f  VC 
Avail(<jj,,.a  X  p'a  +  2 Ua.a  X  Qa.n)  =  MV  +  SV  +  VC  +  VA 


Avail(p,.  n)  =  MV  +  2VC  +  2VA 
Ava«l(poa)  =  MV  +  2  VC  +  3VA 


n 


where  the  last  line  is  added  so  that  pau  satisfies  the  delay  conditions  (assuming  MV  > 
SV  +  2VA). 

Since  these  satisfy  the  minimum  delay  conditions,  propagation  occurs  at  a  rate  of 
[MV  -t  VA)  per  node.  It  can  readily  be  seen  that,  in  general, 


Avail[Xa,b)  =  Avatl[Xa,a)  +  ri0B2(fe  -  O  +  1)](MV  +  VA). 


Thus  in  particular,  if  X,  is  the  linear  recursive  variable  corresponding  to  Xa,b ,  then  X,  =  X0., 


so 


Avatl[X,)  =  Avail(Xo,i) 

-  Avaxl[Xa,a)  +  [log2(.  +  1)1(MV  +  VA). 


Hence, 


Avat{(uio.i)  =  Avail(  w0,0)  +  [log2(»  +  1)KMV  +  VA) 

Ava»/((j0>l)  —  Avail[ua,a)  +  flog2(i  +  1)](MV  +  VA) 

Avatl(p 0>l)  =  Avail(pa  a)  -f  flog2(»  +  l)l(Af  V  +  VA) 

Avai/('r0,,)  =  max( AvoW(p0  l),  Ava»Z(w0il)  -f-  VC  +  VA,  Avat7(wo,t)  +  2VC  +  VA)  +  VA 
Avatl[Fo,,)  —  Avai7(r0ll)  -f  SV 

=  MV  +  SV  +  2  VC  +  4VA  +  [log2(.  +  lJKA/V  +  VA) 

Avail(N0,,)  =  max(Avai<(u.'o,1)  -4-  VC,  Ava«7(w0,i))  +  MV  +  VA 


-  2MV  +  VC  +  2VA  +  flog2(«  -f  1)1(MV  +  VA) 


107 


Thereafter,  on  the  forward  recursion  (from  Table  IV.  1), 


Avail(f,ln)  =  Avail(F0„) 

=  MV  +  5V  +  1VC  +  A VA  +  [log.,(n  +  1)1(MV  -r  VA) 

Avatl(n„'„)  max(Avai/(F0,n)  4-  VC,Avail(N0, „))  +  VA 
=  A/V  +  VC  +  3VA 

+  max(Af  V,  SV  +  2VC  +  2VA)  +  [log2(n  f  1)] (MV  +  VA) 

which  satisfies  the  delay  conditions.  Propagation  therefore  occurs  at  the  maximum  rate  and 

Avad{nn,0)  =  MV  4  VC  +  3VA 

4-  max(M  V,  SV  4-  2 VC  4-  2VA)  4-  2flog2(n  -r  l)l(Af  V  4-  VA) 
AvaW(r0)  =  MV  4-  VC  4-  3VA 

4-  max(A/V,  SV  +  2VC  4-  2VA)  -4-  2[log2(n  4-  1  )}{MV  4-  VA) 

=  2[iog2(n  4-  1)1(MV  4-  VA)  4-  MV  +  SV  +  3 VC  4-  5VA 
=  2,rlog2(n  4-  1)](M  V  4-  VA)  4-  5  Mutts  4-  10  Addns 
>  Avaxlij i) 

assuming  again  a  maximally  parallel  system. 


108 


Appendix  D  —  Unification  of  Logarithmic  o  —  6  and  a  /  b  Cases 

By  the  following  technical  artifice  we  can  make  the  o  =  6  case  look  like  a  ^  b 

Wa,k  =  I 
W/k-H.k  = 

=  &a*a  —  1  9a 

uk  +  \,b  —  0 

&aZa  —  l'9a 
W/c+l,b  =  o 

Qa.k  =  ^a*a~l9e 

Qk+l.b  =  0 

Ra,k  =  0 

Rk  +  \,b  —  Pa 

Sa.k  =  0 
S*  +  l,6  =  0 

Pa,*  =  ®oOa  —  l?a 
Pk  +  l.b  —  Pa 

Qa,„  is  substituted  for  Q*+i,b  in  p0  0 

Wa+t,*+i  =  I 
W*+2,(,+i  =  A,+1 
fa,k  =  Fa 
/*-M  h  =  0 

n<j,*  ==  Na 

n*+i,i>  =  s*  X  Fa- 

Now  following  two  applications  of  the  (n/ 2)  processor  nodes  to  the  n  groups  of  input 
data  we  have  Xa,a  as  required,  and  similarly  on  the  forward  recursion. 
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